2024-06-26T06:58:35.5115209Z Current runner version: '2.317.0' 2024-06-26T06:58:35.5121674Z Runner name: 'i-089977a14ceff415f' 2024-06-26T06:58:35.5122502Z Runner group name: 'Default' 2024-06-26T06:58:35.5123504Z Machine name: 'ip-10-0-39-17' 2024-06-26T06:58:35.5128643Z ##[group]GITHUB_TOKEN Permissions 2024-06-26T06:58:35.5130696Z Actions: read 2024-06-26T06:58:35.5131260Z Attestations: read 2024-06-26T06:58:35.5131827Z Checks: read 2024-06-26T06:58:35.5132368Z Contents: read 2024-06-26T06:58:35.5132907Z Deployments: read 2024-06-26T06:58:35.5133467Z Discussions: read 2024-06-26T06:58:35.5134060Z Issues: read 2024-06-26T06:58:35.5134579Z Metadata: read 2024-06-26T06:58:35.5135112Z Packages: read 2024-06-26T06:58:35.5135641Z Pages: read 2024-06-26T06:58:35.5136175Z PullRequests: read 2024-06-26T06:58:35.5136806Z RepositoryProjects: read 2024-06-26T06:58:35.5137442Z SecurityEvents: read 2024-06-26T06:58:35.5138018Z Statuses: read 2024-06-26T06:58:35.5138538Z ##[endgroup] 2024-06-26T06:58:35.5141738Z Secret source: Actions 2024-06-26T06:58:35.5142621Z Prepare workflow directory 2024-06-26T06:58:35.8664691Z Prepare all required actions 2024-06-26T06:58:35.8854009Z Getting action download info 2024-06-26T06:58:36.0503972Z Download action repository 'pytorch/test-infra@main' (SHA:43a2ce341cc31288e9a38b65ce600a7f43021bd5) 2024-06-26T06:58:36.3472988Z Download action repository 'pytorch/pytorch@main' (SHA:c422a9549d2b320b47a9391f6cbcf592b2876c88) 2024-06-26T06:58:39.1115650Z Download action repository 'aws-actions/configure-aws-credentials@v3' (SHA:50ac8dd1e1b10d09dac7b8727528b91bed831ac0) 2024-06-26T06:58:39.2253609Z Download action repository 'seemethere/upload-artifact-s3@v5' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2024-06-26T06:58:39.5202060Z Getting action download info 2024-06-26T06:58:39.6146719Z Download action repository 'malfet/checkout@silent-checkout' (SHA:e07af140b3ccefc05679e3755b9db68f4ee4589c) 2024-06-26T06:58:39.8512396Z Getting action download info 2024-06-26T06:58:39.9477497Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2024-06-26T06:58:40.0924155Z Uses: pytorch/pytorch/.github/workflows/_linux-test.yml@refs/pull/129470/merge (4b51b1a62a63a1add1b3a0f7882f5c7dc66b8f8d) 2024-06-26T06:58:40.0926594Z ##[group] Inputs 2024-06-26T06:58:40.0927064Z build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm86 2024-06-26T06:58:40.0929442Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]} 2024-06-26T06:58:40.0932324Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T06:58:40.0933394Z sync-tag: 2024-06-26T06:58:40.0934179Z timeout-minutes: 240 2024-06-26T06:58:40.0934510Z use-gha: 2024-06-26T06:58:40.0934797Z dashboard-tag: 2024-06-26T06:58:40.0935123Z s3-bucket: gha-artifacts 2024-06-26T06:58:40.0935470Z aws-role-to-assume: 2024-06-26T06:58:40.0935803Z ##[endgroup] 2024-06-26T06:58:40.0936690Z Complete job name: linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T06:58:40.1494445Z A job started hook has been configured by the self-hosted runner administrator 2024-06-26T06:58:40.1634187Z ##[group]Run '/home/ec2-user/runner-scripts/cleanup.sh' 2024-06-26T06:58:40.1644473Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T06:58:40.1645151Z ##[endgroup] 2024-06-26T06:58:40.7695072Z ##[group]Run pytorch/test-infra/.github/actions/setup-ssh@main 2024-06-26T06:58:40.7695700Z with: 2024-06-26T06:58:40.7696470Z github-secret: *** 2024-06-26T06:58:40.7697647Z instructions: All testing is done inside the container, to start an interactive session run: docker exec -it $(docker container ps --format '{{.ID}}') bash 2024-06-26T06:58:40.7698730Z activate-with-label: false 2024-06-26T06:58:40.7699161Z label: with-ssh 2024-06-26T06:58:40.7699567Z remove-existing-keys: true 2024-06-26T06:58:40.7700043Z fail-silently: true 2024-06-26T06:58:40.7700426Z env: 2024-06-26T06:58:40.7700777Z GIT_DEFAULT_BRANCH: main 2024-06-26T06:58:40.7701190Z ##[endgroup] 2024-06-26T06:58:40.8576178Z Please see https://github.com/pytorch/pytorch/wiki/Debugging-using-with-ssh-for-Github-Actions for more info. 2024-06-26T06:58:41.1080216Z Grabbing public ssh keys from https://github.com/leslie-fang-intel.keys 2024-06-26T06:58:41.1839880Z No SSH keys found for user leslie-fang-intel 2024-06-26T06:58:41.1840787Z Grabbing public ssh keys from https://github.com/leslie-fang-intel.keys 2024-06-26T06:58:41.2546107Z No SSH keys found for user leslie-fang-intel 2024-06-26T06:58:41.2653723Z ##[group]Run pytorch/pytorch/.github/actions/checkout-pytorch@main 2024-06-26T06:58:41.2654273Z with: 2024-06-26T06:58:41.2654547Z submodules: recursive 2024-06-26T06:58:41.2654881Z fetch-depth: 0 2024-06-26T06:58:41.2655161Z env: 2024-06-26T06:58:41.2655423Z GIT_DEFAULT_BRANCH: main 2024-06-26T06:58:41.2655760Z ##[endgroup] 2024-06-26T06:58:41.2850803Z ##[group]Run retry () { 2024-06-26T06:58:41.2851253Z retry () { 2024-06-26T06:58:41.2851822Z  $* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*) 2024-06-26T06:58:41.2852458Z } 2024-06-26T06:58:41.2852799Z echo "${GITHUB_WORKSPACE}" 2024-06-26T06:58:41.2853285Z if [ -z "${NO_SUDO}" ]; then 2024-06-26T06:58:41.2853838Z  retry sudo rm -rf "${GITHUB_WORKSPACE}" 2024-06-26T06:58:41.2854321Z else 2024-06-26T06:58:41.2854721Z  retry rm -rf "${GITHUB_WORKSPACE}" 2024-06-26T06:58:41.2855219Z fi 2024-06-26T06:58:41.2855647Z mkdir "${GITHUB_WORKSPACE}" 2024-06-26T06:58:41.2863742Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T06:58:41.2864271Z env: 2024-06-26T06:58:41.2864665Z GIT_DEFAULT_BRANCH: main 2024-06-26T06:58:41.2865054Z NO_SUDO: 2024-06-26T06:58:41.2865426Z ##[endgroup] 2024-06-26T06:58:41.2887965Z /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-06-26T06:58:43.5194048Z ##[group]Run malfet/checkout@silent-checkout 2024-06-26T06:58:43.5194552Z with: 2024-06-26T06:58:43.5194948Z ref: b8c4c54d347aa776934c60784e35936878ef18dc 2024-06-26T06:58:43.5195460Z fetch-depth: 0 2024-06-26T06:58:43.5195837Z submodules: recursive 2024-06-26T06:58:43.5196248Z quiet-checkout: true 2024-06-26T06:58:43.5196675Z repository: pytorch/pytorch 2024-06-26T06:58:43.5197230Z token: *** 2024-06-26T06:58:43.5197596Z ssh-strict: true 2024-06-26T06:58:43.5198001Z persist-credentials: true 2024-06-26T06:58:43.5198422Z clean: true 2024-06-26T06:58:43.5198837Z sparse-checkout-cone-mode: true 2024-06-26T06:58:43.5199309Z lfs: false 2024-06-26T06:58:43.5199680Z set-safe-directory: true 2024-06-26T06:58:43.5200093Z env: 2024-06-26T06:58:43.5200443Z GIT_DEFAULT_BRANCH: main 2024-06-26T06:58:43.5200843Z ##[endgroup] 2024-06-26T06:58:43.6130574Z Syncing repository: pytorch/pytorch 2024-06-26T06:58:43.6132267Z ##[group]Getting Git version info 2024-06-26T06:58:43.6133152Z Working directory is '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2024-06-26T06:58:43.6134182Z [command]/usr/bin/git version 2024-06-26T06:58:43.6134790Z git version 2.40.1 2024-06-26T06:58:43.6136546Z ##[endgroup] 2024-06-26T06:58:43.6148385Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/07b8b0b4-0ea0-46e3-aaf9-6760ca383b10' before making global git config changes 2024-06-26T06:58:43.6149801Z Adding repository directory to the temporary git global config as a safe directory 2024-06-26T06:58:43.6151065Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-06-26T06:58:43.6178561Z Deleting the contents of '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2024-06-26T06:58:43.6182324Z ##[group]Initializing the repository 2024-06-26T06:58:43.6185249Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-06-26T06:58:43.6204459Z hint: Using 'master' as the name for the initial branch. This default branch name 2024-06-26T06:58:43.6205596Z hint: is subject to change. To configure the initial branch name to use in all 2024-06-26T06:58:43.6206463Z hint: of your new repositories, which will suppress this warning, call: 2024-06-26T06:58:43.6207061Z hint: 2024-06-26T06:58:43.6207714Z hint: git config --global init.defaultBranch 2024-06-26T06:58:43.6208221Z hint: 2024-06-26T06:58:43.6208742Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2024-06-26T06:58:43.6209634Z hint: 'development'. The just-created branch can be renamed via this command: 2024-06-26T06:58:43.6210581Z hint: 2024-06-26T06:58:43.6210902Z hint: git branch -m 2024-06-26T06:58:43.6211674Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/ 2024-06-26T06:58:43.6213502Z [command]/usr/bin/git remote add origin https://github.com/pytorch/pytorch 2024-06-26T06:58:43.6237687Z ##[endgroup] 2024-06-26T06:58:43.6238280Z ##[group]Disabling automatic garbage collection 2024-06-26T06:58:43.6240471Z [command]/usr/bin/git config --local gc.auto 0 2024-06-26T06:58:43.6264241Z ##[endgroup] 2024-06-26T06:58:43.6264761Z ##[group]Setting up auth 2024-06-26T06:58:43.6270494Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2024-06-26T06:58:43.6297315Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2024-06-26T06:58:43.6498592Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2024-06-26T06:58:43.6519979Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2024-06-26T06:58:43.6712560Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2024-06-26T06:58:43.6747887Z ##[endgroup] 2024-06-26T06:58:43.6748563Z ##[group]Fetching the repository 2024-06-26T06:58:43.6754221Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --progress --no-recurse-submodules --quiet origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2024-06-26T06:58:45.8297757Z remote: Enumerating objects: 987392 2024-06-26T06:58:45.8298483Z remote: Enumerating objects: 990209, done. 2024-06-26T06:58:45.8299792Z remote: Counting objects: 0% (1/2817) 2024-06-26T06:58:45.8300658Z remote: Counting objects: 1% (29/2817) 2024-06-26T06:58:45.8301559Z remote: Counting objects: 2% (57/2817) 2024-06-26T06:58:45.8302456Z remote: Counting objects: 3% (85/2817) 2024-06-26T06:58:45.8303132Z remote: Counting objects: 4% (113/2817) 2024-06-26T06:58:45.8303738Z remote: Counting objects: 5% (141/2817) 2024-06-26T06:58:45.8304326Z remote: Counting objects: 6% (170/2817) 2024-06-26T06:58:45.8304922Z remote: Counting objects: 7% (198/2817) 2024-06-26T06:58:45.8305715Z remote: Counting objects: 8% (226/2817) 2024-06-26T06:58:45.8306315Z remote: Counting objects: 9% (254/2817) 2024-06-26T06:58:45.8307038Z remote: Counting objects: 10% (282/2817) 2024-06-26T06:58:45.8307734Z remote: Counting objects: 11% (310/2817) 2024-06-26T06:58:45.8308329Z remote: Counting objects: 12% (339/2817) 2024-06-26T06:58:45.8309107Z remote: Counting objects: 13% (367/2817) 2024-06-26T06:58:45.8309837Z remote: Counting objects: 14% (395/2817) 2024-06-26T06:58:45.8310744Z remote: Counting objects: 15% (423/2817) 2024-06-26T06:58:45.8311331Z remote: Counting objects: 16% (451/2817) 2024-06-26T06:58:45.8311937Z remote: Counting objects: 17% (479/2817) 2024-06-26T06:58:45.8312715Z remote: Counting objects: 18% (508/2817) 2024-06-26T06:58:45.8313430Z remote: Counting objects: 19% (536/2817) 2024-06-26T06:58:45.8314028Z remote: Counting objects: 20% (564/2817) 2024-06-26T06:58:45.8314623Z remote: Counting objects: 21% (592/2817) 2024-06-26T06:58:45.8315213Z remote: Counting objects: 22% (620/2817) 2024-06-26T06:58:45.8315814Z remote: Counting objects: 23% (648/2817) 2024-06-26T06:58:45.8316406Z remote: Counting objects: 24% (677/2817) 2024-06-26T06:58:45.8317146Z remote: Counting objects: 25% (705/2817) 2024-06-26T06:58:45.8317931Z remote: Counting objects: 26% (733/2817) 2024-06-26T06:58:45.8318535Z remote: Counting objects: 27% (761/2817) 2024-06-26T06:58:45.8319276Z remote: Counting objects: 28% (789/2817) 2024-06-26T06:58:45.8319904Z remote: Counting objects: 29% (817/2817) 2024-06-26T06:58:45.8320682Z remote: Counting objects: 30% (846/2817) 2024-06-26T06:58:45.8321288Z remote: Counting objects: 31% (874/2817) 2024-06-26T06:58:45.8322025Z remote: Counting objects: 32% (902/2817) 2024-06-26T06:58:45.8322667Z remote: Counting objects: 33% (930/2817) 2024-06-26T06:58:45.8323266Z remote: Counting objects: 34% (958/2817) 2024-06-26T06:58:45.8323908Z remote: Counting objects: 35% (986/2817) 2024-06-26T06:58:45.8324517Z remote: Counting objects: 36% (1015/2817) 2024-06-26T06:58:45.8325335Z remote: Counting objects: 37% (1043/2817) 2024-06-26T06:58:45.8326071Z remote: Counting objects: 38% (1071/2817) 2024-06-26T06:58:45.8326806Z remote: Counting objects: 39% (1099/2817) 2024-06-26T06:58:45.8327510Z remote: Counting objects: 40% (1127/2817) 2024-06-26T06:58:45.8328323Z remote: Counting objects: 41% (1155/2817) 2024-06-26T06:58:45.8329261Z remote: Counting objects: 42% (1184/2817) 2024-06-26T06:58:45.8329911Z remote: Counting objects: 43% (1212/2817) 2024-06-26T06:58:45.8330527Z remote: Counting objects: 44% (1240/2817) 2024-06-26T06:58:45.8331285Z remote: Counting objects: 45% (1268/2817) 2024-06-26T06:58:45.8332211Z remote: Counting objects: 46% (1296/2817) 2024-06-26T06:58:45.8332845Z remote: Counting objects: 47% (1324/2817) 2024-06-26T06:58:45.8333451Z remote: Counting objects: 48% (1353/2817) 2024-06-26T06:58:45.8334256Z remote: Counting objects: 49% (1381/2817) 2024-06-26T06:58:45.8334964Z remote: Counting objects: 50% (1409/2817) 2024-06-26T06:58:45.8335769Z remote: Counting objects: 51% (1437/2817) 2024-06-26T06:58:45.8336582Z remote: Counting objects: 52% (1465/2817) 2024-06-26T06:58:45.8337421Z remote: Counting objects: 53% (1494/2817) 2024-06-26T06:58:45.8338242Z remote: Counting objects: 54% (1522/2817) 2024-06-26T06:58:45.8338886Z remote: Counting objects: 55% (1550/2817) 2024-06-26T06:58:45.8339699Z remote: Counting objects: 56% (1578/2817) 2024-06-26T06:58:45.8340445Z remote: Counting objects: 57% (1606/2817) 2024-06-26T06:58:45.8341219Z remote: Counting objects: 58% (1634/2817) 2024-06-26T06:58:45.8342037Z remote: Counting objects: 59% (1663/2817) 2024-06-26T06:58:45.8342733Z remote: Counting objects: 60% (1691/2817) 2024-06-26T06:58:45.8343442Z remote: Counting objects: 61% (1719/2817) 2024-06-26T06:58:45.8344220Z remote: Counting objects: 62% (1747/2817) 2024-06-26T06:58:45.8344923Z remote: Counting objects: 63% (1775/2817) 2024-06-26T06:58:45.8345833Z remote: Counting objects: 64% (1803/2817) 2024-06-26T06:58:45.8346561Z remote: Counting objects: 65% (1832/2817) 2024-06-26T06:58:45.8347324Z remote: Counting objects: 66% (1860/2817) 2024-06-26T06:58:45.8348095Z remote: Counting objects: 67% (1888/2817) 2024-06-26T06:58:45.8348761Z remote: Counting objects: 68% (1916/2817) 2024-06-26T06:58:45.8349413Z remote: Counting objects: 69% (1944/2817) 2024-06-26T06:58:45.8350044Z remote: Counting objects: 70% (1972/2817) 2024-06-26T06:58:45.8350647Z remote: Counting objects: 71% (2001/2817) 2024-06-26T06:58:45.8351421Z remote: Counting objects: 72% (2029/2817) 2024-06-26T06:58:45.8352094Z remote: Counting objects: 73% (2057/2817) 2024-06-26T06:58:45.8352856Z remote: Counting objects: 74% (2085/2817) 2024-06-26T06:58:45.8353523Z remote: Counting objects: 75% (2113/2817) 2024-06-26T06:58:45.8354130Z remote: Counting objects: 76% (2141/2817) 2024-06-26T06:58:45.8354730Z remote: Counting objects: 77% (2170/2817) 2024-06-26T06:58:45.8355324Z remote: Counting objects: 78% (2198/2817) 2024-06-26T06:58:45.8356134Z remote: Counting objects: 79% (2226/2817) 2024-06-26T06:58:45.8356963Z remote: Counting objects: 80% (2254/2817) 2024-06-26T06:58:45.8357762Z remote: Counting objects: 81% (2282/2817) 2024-06-26T06:58:45.8358565Z remote: Counting objects: 82% (2310/2817) 2024-06-26T06:58:45.8359367Z remote: Counting objects: 83% (2339/2817) 2024-06-26T06:58:45.8360159Z remote: Counting objects: 84% (2367/2817) 2024-06-26T06:58:45.8360792Z remote: Counting objects: 85% (2395/2817) 2024-06-26T06:58:45.8361509Z remote: Counting objects: 86% (2423/2817) 2024-06-26T06:58:45.8362259Z remote: Counting objects: 87% (2451/2817) 2024-06-26T06:58:45.8362882Z remote: Counting objects: 88% (2479/2817) 2024-06-26T06:58:45.8363526Z remote: Counting objects: 89% (2508/2817) 2024-06-26T06:58:45.8364168Z remote: Counting objects: 90% (2536/2817) 2024-06-26T06:58:45.8364925Z remote: Counting objects: 91% (2564/2817) 2024-06-26T06:58:45.8365565Z remote: Counting objects: 92% (2592/2817) 2024-06-26T06:58:45.8366255Z remote: Counting objects: 93% (2620/2817) 2024-06-26T06:58:45.8367015Z remote: Counting objects: 94% (2648/2817) 2024-06-26T06:58:45.8367626Z remote: Counting objects: 95% (2677/2817) 2024-06-26T06:58:45.8368232Z remote: Counting objects: 96% (2705/2817) 2024-06-26T06:58:45.8368847Z remote: Counting objects: 97% (2733/2817) 2024-06-26T06:58:45.8369458Z remote: Counting objects: 98% (2761/2817) 2024-06-26T06:58:45.8370070Z remote: Counting objects: 99% (2789/2817) 2024-06-26T06:58:45.8370673Z remote: Counting objects: 100% (2817/2817) 2024-06-26T06:58:45.8371318Z remote: Counting objects: 100% (2817/2817), done. 2024-06-26T06:58:45.8530339Z remote: Compressing objects: 0% (1/1140) 2024-06-26T06:58:45.9100920Z remote: Compressing objects: 1% (12/1140) 2024-06-26T06:58:45.9520769Z remote: Compressing objects: 2% (23/1140) 2024-06-26T06:58:45.9852744Z remote: Compressing objects: 3% (35/1140) 2024-06-26T06:58:46.0042702Z remote: Compressing objects: 4% (46/1140) 2024-06-26T06:58:46.0327112Z remote: Compressing objects: 5% (57/1140) 2024-06-26T06:58:46.0542587Z remote: Compressing objects: 6% (69/1140) 2024-06-26T06:58:46.1483698Z remote: Compressing objects: 7% (80/1140) 2024-06-26T06:58:46.2092937Z remote: Compressing objects: 8% (92/1140) 2024-06-26T06:58:46.2560842Z remote: Compressing objects: 9% (103/1140) 2024-06-26T06:58:46.2725140Z remote: Compressing objects: 10% (114/1140) 2024-06-26T06:58:46.2824831Z remote: Compressing objects: 11% (126/1140) 2024-06-26T06:58:46.2895359Z remote: Compressing objects: 12% (137/1140) 2024-06-26T06:58:46.2932989Z remote: Compressing objects: 13% (149/1140) 2024-06-26T06:58:46.2933627Z remote: Compressing objects: 14% (160/1140) 2024-06-26T06:58:46.2994177Z remote: Compressing objects: 15% (171/1140) 2024-06-26T06:58:46.2994962Z remote: Compressing objects: 16% (183/1140) 2024-06-26T06:58:46.2997381Z remote: Compressing objects: 17% (194/1140) 2024-06-26T06:58:46.3023592Z remote: Compressing objects: 18% (206/1140) 2024-06-26T06:58:46.3046016Z remote: Compressing objects: 19% (217/1140) 2024-06-26T06:58:46.3065522Z remote: Compressing objects: 20% (228/1140) 2024-06-26T06:58:46.3088608Z remote: Compressing objects: 21% (240/1140) 2024-06-26T06:58:46.3105896Z remote: Compressing objects: 22% (251/1140) 2024-06-26T06:58:46.3115041Z remote: Compressing objects: 23% (263/1140) 2024-06-26T06:58:46.3122035Z remote: Compressing objects: 24% (274/1140) 2024-06-26T06:58:46.3131311Z remote: Compressing objects: 25% (285/1140) 2024-06-26T06:58:46.3136860Z remote: Compressing objects: 26% (297/1140) 2024-06-26T06:58:46.3141950Z remote: Compressing objects: 27% (308/1140) 2024-06-26T06:58:46.3152596Z remote: Compressing objects: 28% (320/1140) 2024-06-26T06:58:46.3165052Z remote: Compressing objects: 29% (331/1140) 2024-06-26T06:58:46.3175389Z remote: Compressing objects: 30% (342/1140) 2024-06-26T06:58:46.3176033Z remote: Compressing objects: 31% (354/1140) 2024-06-26T06:58:46.3203631Z remote: Compressing objects: 32% (365/1140) 2024-06-26T06:58:46.3238891Z remote: Compressing objects: 33% (377/1140) 2024-06-26T06:58:46.3239538Z remote: Compressing objects: 34% (388/1140) 2024-06-26T06:58:46.3240182Z remote: Compressing objects: 35% (399/1140) 2024-06-26T06:58:46.3261844Z remote: Compressing objects: 36% (411/1140) 2024-06-26T06:58:46.3278454Z remote: Compressing objects: 37% (422/1140) 2024-06-26T06:58:46.3290053Z remote: Compressing objects: 38% (434/1140) 2024-06-26T06:58:46.3301908Z remote: Compressing objects: 39% (445/1140) 2024-06-26T06:58:46.3327490Z remote: Compressing objects: 40% (456/1140) 2024-06-26T06:58:46.3349992Z remote: Compressing objects: 41% (468/1140) 2024-06-26T06:58:46.3368652Z remote: Compressing objects: 42% (479/1140) 2024-06-26T06:58:46.3377608Z remote: Compressing objects: 43% (491/1140) 2024-06-26T06:58:46.3383585Z remote: Compressing objects: 44% (502/1140) 2024-06-26T06:58:46.3388471Z remote: Compressing objects: 45% (513/1140) 2024-06-26T06:58:46.3398033Z remote: Compressing objects: 46% (525/1140) 2024-06-26T06:58:46.3414435Z remote: Compressing objects: 47% (536/1140) 2024-06-26T06:58:46.3422493Z remote: Compressing objects: 48% (548/1140) 2024-06-26T06:58:46.3425028Z remote: Compressing objects: 49% (559/1140) 2024-06-26T06:58:46.3430991Z remote: Compressing objects: 50% (570/1140) 2024-06-26T06:58:46.3448073Z remote: Compressing objects: 51% (582/1140) 2024-06-26T06:58:46.3465696Z remote: Compressing objects: 52% (593/1140) 2024-06-26T06:58:46.3478923Z remote: Compressing objects: 53% (605/1140) 2024-06-26T06:58:46.3499407Z remote: Compressing objects: 54% (616/1140) 2024-06-26T06:58:46.3512049Z remote: Compressing objects: 55% (627/1140) 2024-06-26T06:58:46.3520500Z remote: Compressing objects: 56% (639/1140) 2024-06-26T06:58:46.3528987Z remote: Compressing objects: 57% (650/1140) 2024-06-26T06:58:46.3539298Z remote: Compressing objects: 58% (662/1140) 2024-06-26T06:58:46.3551012Z remote: Compressing objects: 59% (673/1140) 2024-06-26T06:58:46.3562663Z remote: Compressing objects: 60% (684/1140) 2024-06-26T06:58:46.3572966Z remote: Compressing objects: 61% (696/1140) 2024-06-26T06:58:46.3584061Z remote: Compressing objects: 62% (707/1140) 2024-06-26T06:58:46.3594203Z remote: Compressing objects: 63% (719/1140) 2024-06-26T06:58:46.3604350Z remote: Compressing objects: 64% (730/1140) 2024-06-26T06:58:46.3612765Z remote: Compressing objects: 65% (741/1140) 2024-06-26T06:58:46.3630680Z remote: Compressing objects: 66% (753/1140) 2024-06-26T06:58:46.3633057Z remote: Compressing objects: 67% (764/1140) 2024-06-26T06:58:46.3637096Z remote: Compressing objects: 68% (776/1140) 2024-06-26T06:58:46.3643336Z remote: Compressing objects: 69% (787/1140) 2024-06-26T06:58:46.3648202Z remote: Compressing objects: 70% (798/1140) 2024-06-26T06:58:46.3653433Z remote: Compressing objects: 71% (810/1140) 2024-06-26T06:58:46.3654085Z remote: Compressing objects: 72% (821/1140) 2024-06-26T06:58:46.3656665Z remote: Compressing objects: 73% (833/1140) 2024-06-26T06:58:46.3657673Z remote: Compressing objects: 74% (844/1140) 2024-06-26T06:58:46.3658309Z remote: Compressing objects: 75% (855/1140) 2024-06-26T06:58:46.3660178Z remote: Compressing objects: 76% (867/1140) 2024-06-26T06:58:46.3660921Z remote: Compressing objects: 77% (878/1140) 2024-06-26T06:58:46.3661570Z remote: Compressing objects: 78% (890/1140) 2024-06-26T06:58:46.3662208Z remote: Compressing objects: 79% (901/1140) 2024-06-26T06:58:46.3664568Z remote: Compressing objects: 80% (912/1140) 2024-06-26T06:58:46.3666712Z remote: Compressing objects: 81% (924/1140) 2024-06-26T06:58:46.3667731Z remote: Compressing objects: 82% (935/1140) 2024-06-26T06:58:46.3669870Z remote: Compressing objects: 83% (947/1140) 2024-06-26T06:58:46.3670905Z remote: Compressing objects: 84% (958/1140) 2024-06-26T06:58:46.3672581Z remote: Compressing objects: 85% (969/1140) 2024-06-26T06:58:46.3673826Z remote: Compressing objects: 86% (981/1140) 2024-06-26T06:58:46.3675427Z remote: Compressing objects: 87% (992/1140) 2024-06-26T06:58:46.3677952Z remote: Compressing objects: 88% (1004/1140) 2024-06-26T06:58:46.3679501Z remote: Compressing objects: 89% (1015/1140) 2024-06-26T06:58:46.3681335Z remote: Compressing objects: 90% (1026/1140) 2024-06-26T06:58:46.3682647Z remote: Compressing objects: 91% (1038/1140) 2024-06-26T06:58:46.3684489Z remote: Compressing objects: 92% (1049/1140) 2024-06-26T06:58:46.3685988Z remote: Compressing objects: 93% (1061/1140) 2024-06-26T06:58:46.3689502Z remote: Compressing objects: 94% (1072/1140) 2024-06-26T06:58:46.3694236Z remote: Compressing objects: 95% (1083/1140) 2024-06-26T06:58:46.3694895Z remote: Compressing objects: 96% (1095/1140) 2024-06-26T06:58:46.3699504Z remote: Compressing objects: 97% (1106/1140) 2024-06-26T06:58:46.3700402Z remote: Compressing objects: 98% (1118/1140) 2024-06-26T06:58:46.3701299Z remote: Compressing objects: 99% (1129/1140) 2024-06-26T06:58:46.3702224Z remote: Compressing objects: 100% (1140/1140) 2024-06-26T06:58:46.3703053Z remote: Compressing objects: 100% (1140/1140), done. 2024-06-26T06:59:05.6911227Z remote: Total 990209 (delta 2280), reused 2016 (delta 1676), pack-reused 987392 2024-06-26T06:59:30.8469517Z [command]/usr/bin/git rev-parse --verify --quiet b8c4c54d347aa776934c60784e35936878ef18dc^{object} 2024-06-26T06:59:30.8491792Z b8c4c54d347aa776934c60784e35936878ef18dc 2024-06-26T06:59:30.8496254Z ##[endgroup] 2024-06-26T06:59:30.8496814Z ##[group]Determining the checkout info 2024-06-26T06:59:30.8498853Z ##[endgroup] 2024-06-26T06:59:30.8499402Z ##[group]Checking out the ref 2024-06-26T06:59:30.8500583Z [command]/usr/bin/git checkout --quiet --force b8c4c54d347aa776934c60784e35936878ef18dc 2024-06-26T06:59:32.0737942Z ##[endgroup] 2024-06-26T06:59:32.0739416Z ##[group]Setting up auth for fetching submodules 2024-06-26T06:59:32.0742070Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2024-06-26T06:59:32.0786378Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2024-06-26T06:59:32.0806557Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2024-06-26T06:59:32.0826012Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2024-06-26T06:59:32.0849416Z ##[endgroup] 2024-06-26T06:59:32.0850224Z ##[group]Fetching submodules 2024-06-26T06:59:32.0853507Z [command]/usr/bin/git submodule sync --recursive 2024-06-26T06:59:32.1060742Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2024-06-26T06:59:32.1273810Z Submodule 'android/libs/fbjni' (https://github.com/facebookincubator/fbjni.git) registered for path 'android/libs/fbjni' 2024-06-26T06:59:32.1275377Z Submodule 'third_party/NNPACK_deps/FP16' (https://github.com/Maratyszcza/FP16.git) registered for path 'third_party/FP16' 2024-06-26T06:59:32.1276956Z Submodule 'third_party/NNPACK_deps/FXdiv' (https://github.com/Maratyszcza/FXdiv.git) registered for path 'third_party/FXdiv' 2024-06-26T06:59:32.1278760Z Submodule 'third_party/NNPACK' (https://github.com/Maratyszcza/NNPACK.git) registered for path 'third_party/NNPACK' 2024-06-26T06:59:32.1280630Z Submodule 'third_party/VulkanMemoryAllocator' (https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git) registered for path 'third_party/VulkanMemoryAllocator' 2024-06-26T06:59:32.1282423Z Submodule 'third_party/XNNPACK' (https://github.com/google/XNNPACK.git) registered for path 'third_party/XNNPACK' 2024-06-26T06:59:32.1284114Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/benchmark' 2024-06-26T06:59:32.1286912Z Submodule 'third_party/cpp-httplib' (https://github.com/yhirose/cpp-httplib.git) registered for path 'third_party/cpp-httplib' 2024-06-26T06:59:32.1288554Z Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo.git) registered for path 'third_party/cpuinfo' 2024-06-26T06:59:32.1291021Z Submodule 'third_party/cudnn_frontend' (https://github.com/NVIDIA/cudnn-frontend.git) registered for path 'third_party/cudnn_frontend' 2024-06-26T06:59:32.1293159Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/cutlass' 2024-06-26T06:59:32.1295519Z Submodule 'third_party/eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'third_party/eigen' 2024-06-26T06:59:32.1297903Z Submodule 'third_party/fbgemm' (https://github.com/pytorch/fbgemm) registered for path 'third_party/fbgemm' 2024-06-26T06:59:32.1300511Z Submodule 'third_party/flatbuffers' (https://github.com/google/flatbuffers.git) registered for path 'third_party/flatbuffers' 2024-06-26T06:59:32.1302922Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/fmt' 2024-06-26T06:59:32.1305671Z Submodule 'third_party/foxi' (https://github.com/houseroad/foxi.git) registered for path 'third_party/foxi' 2024-06-26T06:59:32.1308572Z Submodule 'third_party/gemmlowp/gemmlowp' (https://github.com/google/gemmlowp.git) registered for path 'third_party/gemmlowp/gemmlowp' 2024-06-26T06:59:32.1311121Z Submodule 'third_party/gloo' (https://github.com/facebookincubator/gloo) registered for path 'third_party/gloo' 2024-06-26T06:59:32.1314017Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest' 2024-06-26T06:59:32.1316719Z Submodule 'third_party/ideep' (https://github.com/intel/ideep) registered for path 'third_party/ideep' 2024-06-26T06:59:32.1319765Z Submodule 'third_party/ittapi' (https://github.com/intel/ittapi.git) registered for path 'third_party/ittapi' 2024-06-26T06:59:32.1322800Z Submodule 'third_party/kineto' (https://github.com/pytorch/kineto) registered for path 'third_party/kineto' 2024-06-26T06:59:32.1326524Z Submodule 'third_party/mimalloc' (https://github.com/microsoft/mimalloc.git) registered for path 'third_party/mimalloc' 2024-06-26T06:59:32.1329509Z Submodule 'third_party/nccl/nccl' (https://github.com/NVIDIA/nccl) registered for path 'third_party/nccl/nccl' 2024-06-26T06:59:32.1332788Z Submodule 'third_party/nlohmann' (https://github.com/nlohmann/json.git) registered for path 'third_party/nlohmann' 2024-06-26T06:59:32.1336045Z Submodule 'third_party/onnx' (https://github.com/onnx/onnx.git) registered for path 'third_party/onnx' 2024-06-26T06:59:32.1339729Z Submodule 'third_party/opentelemetry-cpp' (https://github.com/open-telemetry/opentelemetry-cpp.git) registered for path 'third_party/opentelemetry-cpp' 2024-06-26T06:59:32.1342891Z Submodule 'third_party/pocketfft' (https://github.com/mreineck/pocketfft) registered for path 'third_party/pocketfft' 2024-06-26T06:59:32.1346489Z Submodule 'third_party/protobuf' (https://github.com/protocolbuffers/protobuf.git) registered for path 'third_party/protobuf' 2024-06-26T06:59:32.1350107Z Submodule 'third_party/NNPACK_deps/psimd' (https://github.com/Maratyszcza/psimd.git) registered for path 'third_party/psimd' 2024-06-26T06:59:32.1353954Z Submodule 'third_party/NNPACK_deps/pthreadpool' (https://github.com/Maratyszcza/pthreadpool.git) registered for path 'third_party/pthreadpool' 2024-06-26T06:59:32.1357378Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/pybind11' 2024-06-26T06:59:32.1361152Z Submodule 'third_party/python-peachpy' (https://github.com/malfet/PeachPy.git) registered for path 'third_party/python-peachpy' 2024-06-26T06:59:32.1365049Z Submodule 'third_party/sleef' (https://github.com/shibatch/sleef) registered for path 'third_party/sleef' 2024-06-26T06:59:32.1369211Z Submodule 'third_party/tensorpipe' (https://github.com/pytorch/tensorpipe.git) registered for path 'third_party/tensorpipe' 2024-06-26T06:59:32.1389202Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/android/libs/fbjni'... 2024-06-26T06:59:32.3807777Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FP16'... 2024-06-26T06:59:32.6107171Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FXdiv'... 2024-06-26T06:59:32.7574196Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NNPACK'... 2024-06-26T06:59:32.9773917Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/VulkanMemoryAllocator'... 2024-06-26T06:59:35.0288914Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/XNNPACK'... 2024-06-26T06:59:44.5885721Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/benchmark'... 2024-06-26T06:59:44.9511821Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpp-httplib'... 2024-06-26T06:59:45.4514824Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpuinfo'... 2024-06-26T06:59:46.0291502Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cudnn_frontend'... 2024-06-26T06:59:47.2101163Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cutlass'... 2024-06-26T06:59:48.8597563Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/eigen'... 2024-06-26T06:59:54.7654349Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm'... 2024-06-26T06:59:56.0623777Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flatbuffers'... 2024-06-26T06:59:57.6910294Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fmt'... 2024-06-26T06:59:58.8700676Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/foxi'... 2024-06-26T06:59:59.0315971Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gemmlowp/gemmlowp'... 2024-06-26T06:59:59.3880902Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gloo'... 2024-06-26T06:59:59.6873999Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/googletest'... 2024-06-26T07:00:00.6466331Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep'... 2024-06-26T07:00:00.9710205Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ittapi'... 2024-06-26T07:00:01.2079978Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto'... 2024-06-26T07:00:02.7441041Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/mimalloc'... 2024-06-26T07:00:03.5159033Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nccl/nccl'... 2024-06-26T07:00:04.2283697Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nlohmann'... 2024-06-26T07:00:09.7985519Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx'... 2024-06-26T07:00:12.1782005Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp'... 2024-06-26T07:00:17.7478702Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pocketfft'... 2024-06-26T07:00:17.9574059Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf'... 2024-06-26T07:00:26.0882250Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/psimd'... 2024-06-26T07:00:26.2499733Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pthreadpool'... 2024-06-26T07:00:26.4428660Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pybind11'... 2024-06-26T07:00:27.4857509Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/python-peachpy'... 2024-06-26T07:00:27.7354214Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/sleef'... 2024-06-26T07:00:28.3546401Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe'... 2024-06-26T07:00:28.7406651Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2024-06-26T07:00:28.7492185Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2024-06-26T07:00:28.7552177Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2024-06-26T07:00:28.7731151Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2024-06-26T07:00:28.8037785Z Submodule path 'third_party/VulkanMemoryAllocator': checked out 'a6bfc237255a6bac1513f7c1ebde6d8aed6b5191' 2024-06-26T07:00:29.5756187Z Submodule path 'third_party/XNNPACK': checked out 'fcbf55af6cf28a4627bcd1f703ab7ad843f0f3a2' 2024-06-26T07:00:29.5936531Z Submodule path 'third_party/benchmark': checked out '0d98dba29d66e93259db7daa53a9327df767a415' 2024-06-26T07:00:29.6291218Z Submodule path 'third_party/cpp-httplib': checked out '3b6597bba913d51161383657829b7e644e59c006' 2024-06-26T07:00:29.7138815Z Submodule path 'third_party/cpuinfo': checked out '3c8b1533ac03dd6531ab6e7b9245d488f13a82a5' 2024-06-26T07:00:29.7406845Z Submodule path 'third_party/cudnn_frontend': checked out 'aa3abd4bc689d6412979c7f55f9cd132848c9c6a' 2024-06-26T07:00:30.1699728Z Submodule path 'third_party/cutlass': checked out 'bbe579a9e3beb6ea6626d9227ec32d0dae119a49' 2024-06-26T07:00:30.3826359Z Submodule path 'third_party/eigen': checked out '3147391d946bb4b6c68edd901f2add6ac1f31f8c' 2024-06-26T07:00:30.4463285Z Submodule path 'third_party/fbgemm': checked out 'dbc3157bf256f1339b3fa1fef2be89ac4078be0e' 2024-06-26T07:00:30.4474658Z Submodule 'third_party/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'third_party/fbgemm/third_party/asmjit' 2024-06-26T07:00:30.4478164Z Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T07:00:30.4479800Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/fbgemm/third_party/cutlass' 2024-06-26T07:00:30.4481446Z Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'third_party/fbgemm/third_party/googletest' 2024-06-26T07:00:30.4483298Z Submodule 'third_party/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T07:00:30.4500166Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/asmjit'... 2024-06-26T07:00:31.3251056Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/cpuinfo'... 2024-06-26T07:00:31.9107510Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/cutlass'... 2024-06-26T07:00:33.6497403Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/googletest'... 2024-06-26T07:00:34.6194462Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/third_party/hipify_torch'... 2024-06-26T07:00:34.9356113Z Submodule path 'third_party/fbgemm/third_party/asmjit': checked out 'd3fbf7c9bc7c1d1365a94a45614b91c5a3706b81' 2024-06-26T07:00:35.0195879Z Submodule path 'third_party/fbgemm/third_party/cpuinfo': checked out 'ed8b86a253800bafdb7b25c5c399f91bff9cb1f3' 2024-06-26T07:00:35.3602694Z Submodule path 'third_party/fbgemm/third_party/cutlass': checked out 'fc9ebc645b63f3a6bc80aaefde5c063fb72110d6' 2024-06-26T07:00:35.4150789Z Submodule path 'third_party/fbgemm/third_party/googletest': checked out 'cbf019de22c8dd37b2108da35b2748fd702d1796' 2024-06-26T07:00:35.4240949Z Submodule path 'third_party/fbgemm/third_party/hipify_torch': checked out '23f53b025b466d8ec3c45d52290d3442f7fbe6b1' 2024-06-26T07:00:35.5175160Z Submodule path 'third_party/flatbuffers': checked out '01834de25e4bf3975a9a00e816292b1ad0fe184b' 2024-06-26T07:00:35.5453183Z Submodule path 'third_party/fmt': checked out 'e69e5f977d458f2650bb346dadf2ad30c5320281' 2024-06-26T07:00:35.5517603Z Submodule path 'third_party/foxi': checked out 'c278588e34e535f0bb8f00df3880d26928038cad' 2024-06-26T07:00:35.5820257Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2024-06-26T07:00:35.6019447Z Submodule path 'third_party/gloo': checked out '5354032ea08eadd7fc4456477f7f7c6308818509' 2024-06-26T07:00:35.6383546Z Submodule path 'third_party/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2024-06-26T07:00:35.6479899Z Submodule path 'third_party/ideep': checked out '55ca0191687aaf19aca5cdb7881c791e3bea442b' 2024-06-26T07:00:35.6489215Z Submodule 'mkl-dnn' (https://github.com/intel/mkl-dnn.git) registered for path 'third_party/ideep/mkl-dnn' 2024-06-26T07:00:35.6505548Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep/mkl-dnn'... 2024-06-26T07:00:47.4531478Z Submodule path 'third_party/ideep/mkl-dnn': checked out '1137e04ec0b5251ca2b4400a4fd3c667ce843d67' 2024-06-26T07:00:47.4665848Z Submodule path 'third_party/ittapi': checked out '5b8a7d7422611c3a0d799fb5fc5dd4abfae35b42' 2024-06-26T07:00:47.5444156Z Submodule path 'third_party/kineto': checked out '8681ff11e1fa54da39023076c5c43eddd87b7a8a' 2024-06-26T07:00:47.5455019Z Submodule 'libkineto/third_party/dynolog' (https://github.com/facebookincubator/dynolog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T07:00:47.5456941Z Submodule 'libkineto/third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T07:00:47.5458825Z Submodule 'libkineto/third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T07:00:47.5477120Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog'... 2024-06-26T07:00:48.0485290Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/fmt'... 2024-06-26T07:00:49.2590017Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/googletest'... 2024-06-26T07:00:50.3062529Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out '7d04a0053a845370ae06ce317a22a48e9edcc74e' 2024-06-26T07:00:50.3072561Z Submodule 'third_party/DCGM' (https://github.com/NVIDIA/DCGM.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T07:00:50.3074679Z Submodule 'third_party/cpr' (https://github.com/libcpr/cpr.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T07:00:50.3076700Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T07:00:50.3078758Z Submodule 'third_party/gflags' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T07:00:50.3080803Z Submodule 'third_party/glog' (https://github.com/google/glog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T07:00:50.3083087Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T07:00:50.3085466Z Submodule 'third_party/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T07:00:50.3087387Z Submodule 'third_party/pfs' (https://github.com/dtrugman/pfs.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T07:00:50.3105052Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM'... 2024-06-26T07:00:51.3576396Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/cpr'... 2024-06-26T07:00:51.6934127Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/fmt'... 2024-06-26T07:00:52.9749750Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags'... 2024-06-26T07:00:53.2401528Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/glog'... 2024-06-26T07:00:53.6845234Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/googletest'... 2024-06-26T07:00:54.7044431Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/json'... 2024-06-26T07:01:00.2868002Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/pfs'... 2024-06-26T07:01:00.6328910Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2024-06-26T07:01:00.6467804Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2024-06-26T07:01:00.6761265Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2024-06-26T07:01:00.6857372Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2024-06-26T07:01:00.6867213Z Submodule 'doc' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T07:01:00.6883686Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc'... 2024-06-26T07:01:00.9942835Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2024-06-26T07:01:01.0078215Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2024-06-26T07:01:01.0413505Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '58d77fa8070e8cec2dc1ed015d66b454c8d78850' 2024-06-26T07:01:01.1224762Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2024-06-26T07:01:01.1347075Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2024-06-26T07:01:01.1651477Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out 'a33701196adfad74917046096bf5a2aa0ab0bb50' 2024-06-26T07:01:01.2145832Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '7aca84427f224eeed3144123d5230d5871e93347' 2024-06-26T07:01:01.2448709Z Submodule path 'third_party/mimalloc': checked out 'b66e3214d8a104669c2ec05ae91ebc26a8f5ab78' 2024-06-26T07:01:01.2643662Z Submodule path 'third_party/nccl/nccl': checked out '48bb7fec7953112ff37499a272317f6663f8f600' 2024-06-26T07:01:01.3505194Z Submodule path 'third_party/nlohmann': checked out '87cda1d6646592ac5866dc703c8e1839046a6806' 2024-06-26T07:01:01.6217415Z Submodule path 'third_party/onnx': checked out '990217f043af7222348ca8f0301e17fa7b841781' 2024-06-26T07:01:01.6243413Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/onnx/third_party/benchmark' 2024-06-26T07:01:01.6245382Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/onnx/third_party/pybind11' 2024-06-26T07:01:01.6264505Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx/third_party/benchmark'... 2024-06-26T07:01:02.0204053Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx/third_party/pybind11'... 2024-06-26T07:01:03.0108003Z Submodule path 'third_party/onnx/third_party/benchmark': checked out '2dd015dfef425c866d9a43f2c67d8b52d709acb6' 2024-06-26T07:01:03.0384299Z Submodule path 'third_party/onnx/third_party/pybind11': checked out '5b0a6fc2017fcc176545afe3e09c9f9885283242' 2024-06-26T07:01:03.0914345Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2024-06-26T07:01:03.0928322Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark) registered for path 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T07:01:03.0930581Z Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T07:01:03.0932636Z Submodule 'third_party/ms-gsl' (https://github.com/microsoft/GSL) registered for path 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T07:01:03.0964248Z Submodule 'third_party/nlohmann-json' (https://github.com/nlohmann/json) registered for path 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T07:01:03.0966968Z Submodule 'third_party/opentelemetry-proto' (https://github.com/open-telemetry/opentelemetry-proto) registered for path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T07:01:03.0969419Z Submodule 'third_party/opentracing-cpp' (https://github.com/opentracing/opentracing-cpp.git) registered for path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T07:01:03.0971462Z Submodule 'third_party/prometheus-cpp' (https://github.com/jupp0r/prometheus-cpp) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T07:01:03.0973189Z Submodule 'tools/vcpkg' (https://github.com/Microsoft/vcpkg) registered for path 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T07:01:03.0974679Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/benchmark'... 2024-06-26T07:01:03.5229813Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/googletest'... 2024-06-26T07:01:04.6333042Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/ms-gsl'... 2024-06-26T07:01:04.9974581Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/nlohmann-json'... 2024-06-26T07:01:10.5572210Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentelemetry-proto'... 2024-06-26T07:01:10.8463900Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentracing-cpp'... 2024-06-26T07:01:12.1700056Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp'... 2024-06-26T07:01:12.4403513Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/tools/vcpkg'... 2024-06-26T07:01:19.8619192Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2024-06-26T07:01:19.8950117Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2024-06-26T07:01:19.9074637Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2024-06-26T07:01:19.9918190Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2024-06-26T07:01:20.0012047Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2024-06-26T07:01:20.0119825Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2024-06-26T07:01:20.0229528Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2024-06-26T07:01:20.0242518Z Submodule 'civetweb' (https://github.com/civetweb/civetweb.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T07:01:20.0244495Z Submodule 'googletest' (https://github.com/google/googletest.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T07:01:20.0261397Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb'... 2024-06-26T07:01:21.8741697Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest'... 2024-06-26T07:01:23.0697041Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2024-06-26T07:01:23.1059044Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2024-06-26T07:01:23.4612114Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2024-06-26T07:01:23.4689051Z Submodule path 'third_party/pocketfft': checked out '9d3ab05a7fffbc71a492bc6a17be034e83e8f0fe' 2024-06-26T07:01:23.6869717Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2024-06-26T07:01:23.6885048Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/protobuf/third_party/benchmark' 2024-06-26T07:01:23.6886905Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/protobuf/third_party/googletest' 2024-06-26T07:01:23.6904768Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/benchmark'... 2024-06-26T07:01:24.0939485Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/googletest'... 2024-06-26T07:01:25.0653613Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2024-06-26T07:01:25.1272037Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2024-06-26T07:01:25.1334201Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2024-06-26T07:01:25.1425219Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2024-06-26T07:01:25.1707377Z Submodule path 'third_party/pybind11': checked out '3e9dfa2866941655c56877882565e7577de6fc7b' 2024-06-26T07:01:25.1929554Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2024-06-26T07:01:25.2256615Z Submodule path 'third_party/sleef': checked out '60e76d2bce17d278b439d9da17177c8f957a9e9b' 2024-06-26T07:01:25.2455540Z Submodule path 'third_party/tensorpipe': checked out '52791a2fd214b2a9dc5759d36725909c1daa7f2e' 2024-06-26T07:01:25.2467305Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/tensorpipe/third_party/googletest' 2024-06-26T07:01:25.2469344Z Submodule 'third_party/libnop' (https://github.com/google/libnop.git) registered for path 'third_party/tensorpipe/third_party/libnop' 2024-06-26T07:01:25.2471173Z Submodule 'third_party/libuv' (https://github.com/libuv/libuv.git) registered for path 'third_party/tensorpipe/third_party/libuv' 2024-06-26T07:01:25.2473672Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T07:01:25.2489719Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/googletest'... 2024-06-26T07:01:26.3045092Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libnop'... 2024-06-26T07:01:26.5157318Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libuv'... 2024-06-26T07:01:27.5580258Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11'... 2024-06-26T07:01:28.6206354Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2024-06-26T07:01:28.6318429Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2024-06-26T07:01:28.6877682Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '1dff88e5161cba5c59276d2070d2e304e4dcb242' 2024-06-26T07:01:28.7105523Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2024-06-26T07:01:28.7115629Z Submodule 'tools/clang' (https://github.com/wjakob/clang-cindex-python3) registered for path 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T07:01:28.7131998Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11/tools/clang'... 2024-06-26T07:01:28.9309018Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2024-06-26T07:01:28.9331696Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2024-06-26T07:01:28.9533586Z Entering 'android/libs/fbjni' 2024-06-26T07:01:28.9559159Z Entering 'third_party/FP16' 2024-06-26T07:01:28.9584678Z Entering 'third_party/FXdiv' 2024-06-26T07:01:28.9610908Z Entering 'third_party/NNPACK' 2024-06-26T07:01:28.9637918Z Entering 'third_party/VulkanMemoryAllocator' 2024-06-26T07:01:28.9664536Z Entering 'third_party/XNNPACK' 2024-06-26T07:01:28.9702697Z Entering 'third_party/benchmark' 2024-06-26T07:01:28.9729124Z Entering 'third_party/cpp-httplib' 2024-06-26T07:01:28.9761623Z Entering 'third_party/cpuinfo' 2024-06-26T07:01:28.9788006Z Entering 'third_party/cudnn_frontend' 2024-06-26T07:01:28.9816499Z Entering 'third_party/cutlass' 2024-06-26T07:01:28.9850567Z Entering 'third_party/eigen' 2024-06-26T07:01:28.9877630Z Entering 'third_party/fbgemm' 2024-06-26T07:01:28.9904165Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-06-26T07:01:28.9932319Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T07:01:28.9956786Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-06-26T07:01:28.9988171Z Entering 'third_party/fbgemm/third_party/googletest' 2024-06-26T07:01:29.0017497Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T07:01:29.0044220Z Entering 'third_party/flatbuffers' 2024-06-26T07:01:29.0075454Z Entering 'third_party/fmt' 2024-06-26T07:01:29.0101904Z Entering 'third_party/foxi' 2024-06-26T07:01:29.0128777Z Entering 'third_party/gemmlowp/gemmlowp' 2024-06-26T07:01:29.0157764Z Entering 'third_party/gloo' 2024-06-26T07:01:29.0185164Z Entering 'third_party/googletest' 2024-06-26T07:01:29.0215228Z Entering 'third_party/ideep' 2024-06-26T07:01:29.0240779Z Entering 'third_party/ideep/mkl-dnn' 2024-06-26T07:01:29.0274268Z Entering 'third_party/ittapi' 2024-06-26T07:01:29.0302181Z Entering 'third_party/kineto' 2024-06-26T07:01:29.0329339Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T07:01:29.0356633Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T07:01:29.0382982Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T07:01:29.0411558Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T07:01:29.0438923Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T07:01:29.0463987Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T07:01:29.0491957Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T07:01:29.0518217Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T07:01:29.0545346Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T07:01:29.0574061Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T07:01:29.0602860Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T07:01:29.0632128Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T07:01:29.0659458Z Entering 'third_party/mimalloc' 2024-06-26T07:01:29.0686187Z Entering 'third_party/nccl/nccl' 2024-06-26T07:01:29.0713653Z Entering 'third_party/nlohmann' 2024-06-26T07:01:29.0739906Z Entering 'third_party/onnx' 2024-06-26T07:01:29.0777633Z Entering 'third_party/onnx/third_party/benchmark' 2024-06-26T07:01:29.0806315Z Entering 'third_party/onnx/third_party/pybind11' 2024-06-26T07:01:29.0836770Z Entering 'third_party/opentelemetry-cpp' 2024-06-26T07:01:29.0865081Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T07:01:29.0893172Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T07:01:29.0919795Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T07:01:29.0946703Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T07:01:29.0973108Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T07:01:29.1000873Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T07:01:29.1027101Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T07:01:29.1052856Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T07:01:29.1081121Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T07:01:29.1109934Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T07:01:29.1152528Z Entering 'third_party/pocketfft' 2024-06-26T07:01:29.1182844Z Entering 'third_party/protobuf' 2024-06-26T07:01:29.1211949Z Entering 'third_party/protobuf/third_party/benchmark' 2024-06-26T07:01:29.1238971Z Entering 'third_party/protobuf/third_party/googletest' 2024-06-26T07:01:29.1268300Z Entering 'third_party/psimd' 2024-06-26T07:01:29.1294851Z Entering 'third_party/pthreadpool' 2024-06-26T07:01:29.1323305Z Entering 'third_party/pybind11' 2024-06-26T07:01:29.1351483Z Entering 'third_party/python-peachpy' 2024-06-26T07:01:29.1380902Z Entering 'third_party/sleef' 2024-06-26T07:01:29.1407112Z Entering 'third_party/tensorpipe' 2024-06-26T07:01:29.1435093Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-06-26T07:01:29.1461415Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-06-26T07:01:29.1489016Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-06-26T07:01:29.1515311Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T07:01:29.1541204Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T07:01:29.1578284Z ##[endgroup] 2024-06-26T07:01:29.1580225Z ##[group]Persisting credentials for submodules 2024-06-26T07:01:29.1583393Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2024-06-26T07:01:29.1790078Z Entering 'android/libs/fbjni' 2024-06-26T07:01:29.1827006Z Entering 'third_party/FP16' 2024-06-26T07:01:29.1862237Z Entering 'third_party/FXdiv' 2024-06-26T07:01:29.1894390Z Entering 'third_party/NNPACK' 2024-06-26T07:01:29.1928412Z Entering 'third_party/VulkanMemoryAllocator' 2024-06-26T07:01:29.1965102Z Entering 'third_party/XNNPACK' 2024-06-26T07:01:29.2009782Z Entering 'third_party/benchmark' 2024-06-26T07:01:29.2043885Z Entering 'third_party/cpp-httplib' 2024-06-26T07:01:29.2078996Z Entering 'third_party/cpuinfo' 2024-06-26T07:01:29.2114897Z Entering 'third_party/cudnn_frontend' 2024-06-26T07:01:29.2150846Z Entering 'third_party/cutlass' 2024-06-26T07:01:29.2187899Z Entering 'third_party/eigen' 2024-06-26T07:01:29.2222032Z Entering 'third_party/fbgemm' 2024-06-26T07:01:29.2259271Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-06-26T07:01:29.2292871Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T07:01:29.2327557Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-06-26T07:01:29.2365957Z Entering 'third_party/fbgemm/third_party/googletest' 2024-06-26T07:01:29.2398898Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T07:01:29.2434561Z Entering 'third_party/flatbuffers' 2024-06-26T07:01:29.2473165Z Entering 'third_party/fmt' 2024-06-26T07:01:29.2508959Z Entering 'third_party/foxi' 2024-06-26T07:01:29.2544575Z Entering 'third_party/gemmlowp/gemmlowp' 2024-06-26T07:01:29.2579868Z Entering 'third_party/gloo' 2024-06-26T07:01:29.2612098Z Entering 'third_party/googletest' 2024-06-26T07:01:29.2649498Z Entering 'third_party/ideep' 2024-06-26T07:01:29.2680962Z Entering 'third_party/ideep/mkl-dnn' 2024-06-26T07:01:29.2721256Z Entering 'third_party/ittapi' 2024-06-26T07:01:29.2755513Z Entering 'third_party/kineto' 2024-06-26T07:01:29.2790678Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T07:01:29.2826904Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T07:01:29.2860802Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T07:01:29.2894867Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T07:01:29.2928707Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T07:01:29.2961742Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T07:01:29.2997206Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T07:01:29.3032230Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T07:01:29.3065854Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T07:01:29.3101510Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T07:01:29.3136874Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T07:01:29.3169677Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T07:01:29.3205664Z Entering 'third_party/mimalloc' 2024-06-26T07:01:29.3243091Z Entering 'third_party/nccl/nccl' 2024-06-26T07:01:29.3277909Z Entering 'third_party/nlohmann' 2024-06-26T07:01:29.3314281Z Entering 'third_party/onnx' 2024-06-26T07:01:29.3362230Z Entering 'third_party/onnx/third_party/benchmark' 2024-06-26T07:01:29.3397432Z Entering 'third_party/onnx/third_party/pybind11' 2024-06-26T07:01:29.3435150Z Entering 'third_party/opentelemetry-cpp' 2024-06-26T07:01:29.3474292Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T07:01:29.3507588Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T07:01:29.3541198Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T07:01:29.3575681Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T07:01:29.3610687Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T07:01:29.3644362Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T07:01:29.3677533Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T07:01:29.3711286Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T07:01:29.3747096Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T07:01:29.3781215Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T07:01:29.3830219Z Entering 'third_party/pocketfft' 2024-06-26T07:01:29.3864582Z Entering 'third_party/protobuf' 2024-06-26T07:01:29.3901763Z Entering 'third_party/protobuf/third_party/benchmark' 2024-06-26T07:01:29.3936094Z Entering 'third_party/protobuf/third_party/googletest' 2024-06-26T07:01:29.3970739Z Entering 'third_party/psimd' 2024-06-26T07:01:29.4006286Z Entering 'third_party/pthreadpool' 2024-06-26T07:01:29.4042071Z Entering 'third_party/pybind11' 2024-06-26T07:01:29.4074741Z Entering 'third_party/python-peachpy' 2024-06-26T07:01:29.4109657Z Entering 'third_party/sleef' 2024-06-26T07:01:29.4144073Z Entering 'third_party/tensorpipe' 2024-06-26T07:01:29.4179784Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-06-26T07:01:29.4214437Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-06-26T07:01:29.4247116Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-06-26T07:01:29.4280645Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T07:01:29.4310725Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T07:01:29.4359349Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2024-06-26T07:01:29.4566970Z Entering 'android/libs/fbjni' 2024-06-26T07:01:29.4602848Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2024-06-26T07:01:29.4616035Z Entering 'third_party/FP16' 2024-06-26T07:01:29.4648465Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2024-06-26T07:01:29.4661083Z Entering 'third_party/FXdiv' 2024-06-26T07:01:29.4691828Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2024-06-26T07:01:29.4703448Z Entering 'third_party/NNPACK' 2024-06-26T07:01:29.4736047Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2024-06-26T07:01:29.4749892Z Entering 'third_party/VulkanMemoryAllocator' 2024-06-26T07:01:29.4782539Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2024-06-26T07:01:29.4795938Z Entering 'third_party/XNNPACK' 2024-06-26T07:01:29.4827591Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2024-06-26T07:01:29.4852274Z Entering 'third_party/benchmark' 2024-06-26T07:01:29.4884549Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2024-06-26T07:01:29.4897493Z Entering 'third_party/cpp-httplib' 2024-06-26T07:01:29.4929915Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2024-06-26T07:01:29.4943837Z Entering 'third_party/cpuinfo' 2024-06-26T07:01:29.4975863Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2024-06-26T07:01:29.4989781Z Entering 'third_party/cudnn_frontend' 2024-06-26T07:01:29.5020773Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2024-06-26T07:01:29.5031618Z Entering 'third_party/cutlass' 2024-06-26T07:01:29.5063129Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2024-06-26T07:01:29.5080916Z Entering 'third_party/eigen' 2024-06-26T07:01:29.5113293Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/eigen/config remote.origin.url 2024-06-26T07:01:29.5126564Z Entering 'third_party/fbgemm' 2024-06-26T07:01:29.5163722Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2024-06-26T07:01:29.5178139Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-06-26T07:01:29.5209451Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/asmjit/config remote.origin.url 2024-06-26T07:01:29.5221893Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T07:01:29.5253616Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/cpuinfo/config remote.origin.url 2024-06-26T07:01:29.5266293Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-06-26T07:01:29.5300270Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/cutlass/config remote.origin.url 2024-06-26T07:01:29.5316784Z Entering 'third_party/fbgemm/third_party/googletest' 2024-06-26T07:01:29.5349222Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/googletest/config remote.origin.url 2024-06-26T07:01:29.5361424Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T07:01:29.5393195Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/third_party/hipify_torch/config remote.origin.url 2024-06-26T07:01:29.5406320Z Entering 'third_party/flatbuffers' 2024-06-26T07:01:29.5439684Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2024-06-26T07:01:29.5454686Z Entering 'third_party/fmt' 2024-06-26T07:01:29.5484404Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2024-06-26T07:01:29.5499801Z Entering 'third_party/foxi' 2024-06-26T07:01:29.5531013Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/foxi/config remote.origin.url 2024-06-26T07:01:29.5544484Z Entering 'third_party/gemmlowp/gemmlowp' 2024-06-26T07:01:29.5575862Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2024-06-26T07:01:29.5590020Z Entering 'third_party/gloo' 2024-06-26T07:01:29.5621334Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2024-06-26T07:01:29.5635295Z Entering 'third_party/googletest' 2024-06-26T07:01:29.5668006Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2024-06-26T07:01:29.5681109Z Entering 'third_party/ideep' 2024-06-26T07:01:29.5713700Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2024-06-26T07:01:29.5725662Z Entering 'third_party/ideep/mkl-dnn' 2024-06-26T07:01:29.5757876Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2024-06-26T07:01:29.5776208Z Entering 'third_party/ittapi' 2024-06-26T07:01:29.5808306Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2024-06-26T07:01:29.5821124Z Entering 'third_party/kineto' 2024-06-26T07:01:29.5852791Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2024-06-26T07:01:29.5866341Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T07:01:29.5897826Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2024-06-26T07:01:29.5910319Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T07:01:29.5943505Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2024-06-26T07:01:29.5957225Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T07:01:29.5989449Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2024-06-26T07:01:29.6002277Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T07:01:29.6036106Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2024-06-26T07:01:29.6048376Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T07:01:29.6080746Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2024-06-26T07:01:29.6092468Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T07:01:29.6124429Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2024-06-26T07:01:29.6139068Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T07:01:29.6171236Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2024-06-26T07:01:29.6183753Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T07:01:29.6217242Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2024-06-26T07:01:29.6228407Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T07:01:29.6260636Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2024-06-26T07:01:29.6272585Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T07:01:29.6304495Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2024-06-26T07:01:29.6318235Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T07:01:29.6352036Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2024-06-26T07:01:29.6364461Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T07:01:29.6397212Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2024-06-26T07:01:29.6411644Z Entering 'third_party/mimalloc' 2024-06-26T07:01:29.6442706Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2024-06-26T07:01:29.6456087Z Entering 'third_party/nccl/nccl' 2024-06-26T07:01:29.6488628Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nccl/nccl/config remote.origin.url 2024-06-26T07:01:29.6501833Z Entering 'third_party/nlohmann' 2024-06-26T07:01:29.6532263Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2024-06-26T07:01:29.6544701Z Entering 'third_party/onnx' 2024-06-26T07:01:29.6576972Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2024-06-26T07:01:29.6602277Z Entering 'third_party/onnx/third_party/benchmark' 2024-06-26T07:01:29.6634442Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/benchmark/config remote.origin.url 2024-06-26T07:01:29.6648228Z Entering 'third_party/onnx/third_party/pybind11' 2024-06-26T07:01:29.6679717Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2024-06-26T07:01:29.6694648Z Entering 'third_party/opentelemetry-cpp' 2024-06-26T07:01:29.6727688Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2024-06-26T07:01:29.6741483Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T07:01:29.6771624Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2024-06-26T07:01:29.6783294Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T07:01:29.6814431Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2024-06-26T07:01:29.6826117Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T07:01:29.6857082Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2024-06-26T07:01:29.6869502Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T07:01:29.6901345Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2024-06-26T07:01:29.6915092Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T07:01:29.6947806Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2024-06-26T07:01:29.6959568Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T07:01:29.6992434Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2024-06-26T07:01:29.7005053Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T07:01:29.7037689Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2024-06-26T07:01:29.7049664Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T07:01:29.7081318Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2024-06-26T07:01:29.7095573Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T07:01:29.7126820Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2024-06-26T07:01:29.7140254Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T07:01:29.7169626Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2024-06-26T07:01:29.7197135Z Entering 'third_party/pocketfft' 2024-06-26T07:01:29.7230194Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2024-06-26T07:01:29.7243018Z Entering 'third_party/protobuf' 2024-06-26T07:01:29.7274484Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2024-06-26T07:01:29.7288333Z Entering 'third_party/protobuf/third_party/benchmark' 2024-06-26T07:01:29.7319882Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2024-06-26T07:01:29.7332271Z Entering 'third_party/protobuf/third_party/googletest' 2024-06-26T07:01:29.7364194Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2024-06-26T07:01:29.7378524Z Entering 'third_party/psimd' 2024-06-26T07:01:29.7410533Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2024-06-26T07:01:29.7422681Z Entering 'third_party/pthreadpool' 2024-06-26T07:01:29.7452541Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2024-06-26T07:01:29.7464384Z Entering 'third_party/pybind11' 2024-06-26T07:01:29.7494792Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2024-06-26T07:01:29.7508856Z Entering 'third_party/python-peachpy' 2024-06-26T07:01:29.7539946Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2024-06-26T07:01:29.7554012Z Entering 'third_party/sleef' 2024-06-26T07:01:29.7586978Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2024-06-26T07:01:29.7598735Z Entering 'third_party/tensorpipe' 2024-06-26T07:01:29.7629009Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2024-06-26T07:01:29.7642541Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-06-26T07:01:29.7673961Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2024-06-26T07:01:29.7687305Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-06-26T07:01:29.7718486Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2024-06-26T07:01:29.7730486Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-06-26T07:01:29.7761070Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2024-06-26T07:01:29.7773011Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T07:01:29.7804022Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2024-06-26T07:01:29.7816334Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T07:01:29.7847971Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2024-06-26T07:01:29.8663199Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2024-06-26T07:01:29.8872970Z Entering 'android/libs/fbjni' 2024-06-26T07:01:29.8900244Z Entering 'third_party/FP16' 2024-06-26T07:01:29.8928681Z Entering 'third_party/FXdiv' 2024-06-26T07:01:29.8954768Z Entering 'third_party/NNPACK' 2024-06-26T07:01:29.8983308Z Entering 'third_party/VulkanMemoryAllocator' 2024-06-26T07:01:29.9010813Z Entering 'third_party/XNNPACK' 2024-06-26T07:01:29.9048809Z Entering 'third_party/benchmark' 2024-06-26T07:01:29.9078207Z Entering 'third_party/cpp-httplib' 2024-06-26T07:01:29.9106364Z Entering 'third_party/cpuinfo' 2024-06-26T07:01:29.9132783Z Entering 'third_party/cudnn_frontend' 2024-06-26T07:01:29.9161748Z Entering 'third_party/cutlass' 2024-06-26T07:01:29.9197513Z Entering 'third_party/eigen' 2024-06-26T07:01:29.9226612Z Entering 'third_party/fbgemm' 2024-06-26T07:01:29.9255242Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-06-26T07:01:29.9282041Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T07:01:29.9310827Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-06-26T07:01:29.9342760Z Entering 'third_party/fbgemm/third_party/googletest' 2024-06-26T07:01:29.9369442Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T07:01:29.9398603Z Entering 'third_party/flatbuffers' 2024-06-26T07:01:29.9428314Z Entering 'third_party/fmt' 2024-06-26T07:01:29.9454888Z Entering 'third_party/foxi' 2024-06-26T07:01:29.9482517Z Entering 'third_party/gemmlowp/gemmlowp' 2024-06-26T07:01:29.9513040Z Entering 'third_party/gloo' 2024-06-26T07:01:29.9539848Z Entering 'third_party/googletest' 2024-06-26T07:01:29.9566400Z Entering 'third_party/ideep' 2024-06-26T07:01:29.9594202Z Entering 'third_party/ideep/mkl-dnn' 2024-06-26T07:01:29.9626627Z Entering 'third_party/ittapi' 2024-06-26T07:01:29.9654710Z Entering 'third_party/kineto' 2024-06-26T07:01:29.9680929Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T07:01:29.9709199Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T07:01:29.9736110Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T07:01:29.9764840Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T07:01:29.9792233Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T07:01:29.9818799Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T07:01:29.9846758Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T07:01:29.9872646Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T07:01:29.9900593Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T07:01:29.9929021Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T07:01:29.9955172Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T07:01:29.9981363Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T07:01:30.0010467Z Entering 'third_party/mimalloc' 2024-06-26T07:01:30.0038259Z Entering 'third_party/nccl/nccl' 2024-06-26T07:01:30.0065164Z Entering 'third_party/nlohmann' 2024-06-26T07:01:30.0094411Z Entering 'third_party/onnx' 2024-06-26T07:01:30.0132787Z Entering 'third_party/onnx/third_party/benchmark' 2024-06-26T07:01:30.0160360Z Entering 'third_party/onnx/third_party/pybind11' 2024-06-26T07:01:30.0189321Z Entering 'third_party/opentelemetry-cpp' 2024-06-26T07:01:30.0219547Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T07:01:30.0245374Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T07:01:30.0273118Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T07:01:30.0300050Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T07:01:30.0329517Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T07:01:30.0357659Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T07:01:30.0384968Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T07:01:30.0412020Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T07:01:30.0440178Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T07:01:30.0470281Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T07:01:30.0514521Z Entering 'third_party/pocketfft' 2024-06-26T07:01:30.0543758Z Entering 'third_party/protobuf' 2024-06-26T07:01:30.0573548Z Entering 'third_party/protobuf/third_party/benchmark' 2024-06-26T07:01:30.0601327Z Entering 'third_party/protobuf/third_party/googletest' 2024-06-26T07:01:30.0630481Z Entering 'third_party/psimd' 2024-06-26T07:01:30.0658072Z Entering 'third_party/pthreadpool' 2024-06-26T07:01:30.0683580Z Entering 'third_party/pybind11' 2024-06-26T07:01:30.0711991Z Entering 'third_party/python-peachpy' 2024-06-26T07:01:30.0738464Z Entering 'third_party/sleef' 2024-06-26T07:01:30.0765015Z Entering 'third_party/tensorpipe' 2024-06-26T07:01:30.0793691Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-06-26T07:01:30.0822019Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-06-26T07:01:30.0850224Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-06-26T07:01:30.0876679Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T07:01:30.0903711Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T07:01:30.0943086Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2024-06-26T07:01:30.1147730Z Entering 'android/libs/fbjni' 2024-06-26T07:01:30.1177212Z Entering 'third_party/FP16' 2024-06-26T07:01:30.1203783Z Entering 'third_party/FXdiv' 2024-06-26T07:01:30.1232975Z Entering 'third_party/NNPACK' 2024-06-26T07:01:30.1259963Z Entering 'third_party/VulkanMemoryAllocator' 2024-06-26T07:01:30.1288822Z Entering 'third_party/XNNPACK' 2024-06-26T07:01:30.1328055Z Entering 'third_party/benchmark' 2024-06-26T07:01:30.1355200Z Entering 'third_party/cpp-httplib' 2024-06-26T07:01:30.1384722Z Entering 'third_party/cpuinfo' 2024-06-26T07:01:30.1410792Z Entering 'third_party/cudnn_frontend' 2024-06-26T07:01:30.1437844Z Entering 'third_party/cutlass' 2024-06-26T07:01:30.1470270Z Entering 'third_party/eigen' 2024-06-26T07:01:30.1497795Z Entering 'third_party/fbgemm' 2024-06-26T07:01:30.1527578Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-06-26T07:01:30.1554420Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T07:01:30.1584196Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-06-26T07:01:30.1616140Z Entering 'third_party/fbgemm/third_party/googletest' 2024-06-26T07:01:30.1643800Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T07:01:30.1672935Z Entering 'third_party/flatbuffers' 2024-06-26T07:01:30.1701557Z Entering 'third_party/fmt' 2024-06-26T07:01:30.1729429Z Entering 'third_party/foxi' 2024-06-26T07:01:30.1757125Z Entering 'third_party/gemmlowp/gemmlowp' 2024-06-26T07:01:30.1785731Z Entering 'third_party/gloo' 2024-06-26T07:01:30.1814727Z Entering 'third_party/googletest' 2024-06-26T07:01:30.1841886Z Entering 'third_party/ideep' 2024-06-26T07:01:30.1870520Z Entering 'third_party/ideep/mkl-dnn' 2024-06-26T07:01:30.1901680Z Entering 'third_party/ittapi' 2024-06-26T07:01:30.1931023Z Entering 'third_party/kineto' 2024-06-26T07:01:30.1958709Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T07:01:30.1984960Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T07:01:30.2013241Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T07:01:30.2041513Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T07:01:30.2067281Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T07:01:30.2092305Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T07:01:30.2122360Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T07:01:30.2149075Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T07:01:30.2174834Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T07:01:30.2203016Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T07:01:30.2231478Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T07:01:30.2259137Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T07:01:30.2287506Z Entering 'third_party/mimalloc' 2024-06-26T07:01:30.2315000Z Entering 'third_party/nccl/nccl' 2024-06-26T07:01:30.2343797Z Entering 'third_party/nlohmann' 2024-06-26T07:01:30.2371173Z Entering 'third_party/onnx' 2024-06-26T07:01:30.2409542Z Entering 'third_party/onnx/third_party/benchmark' 2024-06-26T07:01:30.2436269Z Entering 'third_party/onnx/third_party/pybind11' 2024-06-26T07:01:30.2464143Z Entering 'third_party/opentelemetry-cpp' 2024-06-26T07:01:30.2495243Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T07:01:30.2523261Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T07:01:30.2553241Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T07:01:30.2579880Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T07:01:30.2606988Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T07:01:30.2633818Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T07:01:30.2661058Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T07:01:30.2686904Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T07:01:30.2714810Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T07:01:30.2742784Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T07:01:30.2784997Z Entering 'third_party/pocketfft' 2024-06-26T07:01:30.2813631Z Entering 'third_party/protobuf' 2024-06-26T07:01:30.2843469Z Entering 'third_party/protobuf/third_party/benchmark' 2024-06-26T07:01:30.2872063Z Entering 'third_party/protobuf/third_party/googletest' 2024-06-26T07:01:30.2900106Z Entering 'third_party/psimd' 2024-06-26T07:01:30.2925193Z Entering 'third_party/pthreadpool' 2024-06-26T07:01:30.2953272Z Entering 'third_party/pybind11' 2024-06-26T07:01:30.2980119Z Entering 'third_party/python-peachpy' 2024-06-26T07:01:30.3009703Z Entering 'third_party/sleef' 2024-06-26T07:01:30.3038360Z Entering 'third_party/tensorpipe' 2024-06-26T07:01:30.3067214Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-06-26T07:01:30.3092527Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-06-26T07:01:30.3119263Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-06-26T07:01:30.3147230Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T07:01:30.3172714Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T07:01:30.3211445Z ##[endgroup] 2024-06-26T07:01:30.3243818Z [command]/usr/bin/git log -1 --format='%H' 2024-06-26T07:01:30.3265184Z 'b8c4c54d347aa776934c60784e35936878ef18dc' 2024-06-26T07:01:30.3437821Z Prepare all required actions 2024-06-26T07:01:30.3438351Z Getting action download info 2024-06-26T07:01:30.4765711Z ##[group]Run ./.github/actions/setup-linux 2024-06-26T07:01:30.4766121Z env: 2024-06-26T07:01:30.4766392Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:30.4766737Z ##[endgroup] 2024-06-26T07:01:30.4826945Z ##[group]Run set -euo pipefail 2024-06-26T07:01:30.4827497Z set -euo pipefail 2024-06-26T07:01:30.4827887Z function get_ec2_metadata() { 2024-06-26T07:01:30.4828413Z  # Pulled from instance metadata endpoint for EC2 2024-06-26T07:01:30.4829310Z  # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html 2024-06-26T07:01:30.4830078Z  category=$1 2024-06-26T07:01:30.4830599Z  # If it is GCP runner (runner name contains gcp), do not run this 2024-06-26T07:01:30.4831235Z  runner_name_str=i-089977a14ceff415f 2024-06-26T07:01:30.4831730Z  if [[ -f /.inarc ]]; then 2024-06-26T07:01:30.4832217Z  echo "ARC Runner, no info on ec2 metadata" 2024-06-26T07:01:30.4832785Z  elif [[ $runner_name_str == *"gcp"* ]]; then 2024-06-26T07:01:30.4833483Z  echo "Runner is from Google Cloud Platform, No info on ec2 metadata" 2024-06-26T07:01:30.4834080Z  else 2024-06-26T07:01:30.4834569Z  curl -fsSL "http://169.254.169.254/latest/meta-data/${category}" 2024-06-26T07:01:30.4835150Z  fi 2024-06-26T07:01:30.4835420Z } 2024-06-26T07:01:30.4835779Z echo "ami-id: $(get_ec2_metadata ami-id)" 2024-06-26T07:01:30.4836362Z echo "instance-id: $(get_ec2_metadata instance-id)" 2024-06-26T07:01:30.4837006Z echo "instance-type: $(get_ec2_metadata instance-type)" 2024-06-26T07:01:30.4837572Z echo "system info $(uname -a)" 2024-06-26T07:01:30.4845434Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:30.4845933Z env: 2024-06-26T07:01:30.4846209Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:30.4846549Z ##[endgroup] 2024-06-26T07:01:30.4913644Z ami-id: ami-0ce0c36d7a00b20e2 2024-06-26T07:01:30.4952752Z instance-id: i-089977a14ceff415f 2024-06-26T07:01:30.4998124Z instance-type: g5.4xlarge 2024-06-26T07:01:30.5003913Z system info Linux ip-10-0-39-17.ec2.internal 4.14.336-257.562.amzn2.x86_64 #1 SMP Sat Feb 24 09:50:35 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux 2024-06-26T07:01:30.5026022Z ##[group]Run echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT 2024-06-26T07:01:30.5027173Z echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT 2024-06-26T07:01:30.5034686Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:30.5035181Z env: 2024-06-26T07:01:30.5035448Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:30.5035795Z ##[endgroup] 2024-06-26T07:01:30.5097447Z ##[group]Run if systemctl is-active --quiet docker; then 2024-06-26T07:01:30.5098047Z if systemctl is-active --quiet docker; then 2024-06-26T07:01:30.5098579Z  echo "Docker daemon is running..."; 2024-06-26T07:01:30.5099023Z else 2024-06-26T07:01:30.5099499Z  echo "Starting docker deamon..." && sudo systemctl start docker; 2024-06-26T07:01:30.5100069Z fi 2024-06-26T07:01:30.5106910Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:30.5107417Z env: 2024-06-26T07:01:30.5107689Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:30.5108028Z ##[endgroup] 2024-06-26T07:01:30.5142014Z Docker daemon is running... 2024-06-26T07:01:30.5189726Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2024-06-26T07:01:30.5190304Z with: 2024-06-26T07:01:30.5190567Z shell: bash 2024-06-26T07:01:30.5190859Z timeout_minutes: 5 2024-06-26T07:01:30.5191182Z max_attempts: 3 2024-06-26T07:01:30.5191495Z retry_wait_seconds: 30 2024-06-26T07:01:30.5193019Z command: AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\") aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" 2024-06-26T07:01:30.5194483Z polling_interval_seconds: 1 2024-06-26T07:01:30.5194867Z warning_on_retry: true 2024-06-26T07:01:30.5195219Z continue_on_error: false 2024-06-26T07:01:30.5195551Z env: 2024-06-26T07:01:30.5195927Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:30.5196278Z AWS_RETRY_MODE: standard 2024-06-26T07:01:30.5196614Z AWS_MAX_ATTEMPTS: 5 2024-06-26T07:01:30.5196959Z AWS_DEFAULT_REGION: us-east-1 2024-06-26T07:01:30.5197326Z ##[endgroup] 2024-06-26T07:01:31.3581905Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2024-06-26T07:01:31.3582726Z Configure a credential helper to remove this warning. See 2024-06-26T07:01:31.3583933Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2024-06-26T07:01:31.3584532Z 2024-06-26T07:01:31.3584680Z Login Succeeded 2024-06-26T07:01:31.5687687Z Command completed after 1 attempt(s). 2024-06-26T07:01:31.5730712Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2024-06-26T07:01:31.5731426Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2024-06-26T07:01:31.5732072Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2024-06-26T07:01:31.5739816Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:31.5740313Z env: 2024-06-26T07:01:31.5740583Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:31.5740924Z ##[endgroup] 2024-06-26T07:01:31.5805671Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2024-06-26T07:01:31.5806454Z # ignore expansion of "docker ps -q" since it could be empty 2024-06-26T07:01:31.5807048Z # shellcheck disable=SC2046 2024-06-26T07:01:31.5807504Z docker stop $(docker ps -q) || true 2024-06-26T07:01:31.5807987Z # Prune all of the docker images 2024-06-26T07:01:31.5808436Z docker system prune -af 2024-06-26T07:01:31.5815610Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:31.5816109Z env: 2024-06-26T07:01:31.5816388Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:31.5816723Z ##[endgroup] 2024-06-26T07:01:31.6067537Z "docker stop" requires at least 1 argument. 2024-06-26T07:01:31.6068176Z See 'docker stop --help'. 2024-06-26T07:01:31.6068456Z 2024-06-26T07:01:31.6068678Z Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] 2024-06-26T07:01:31.6070587Z 2024-06-26T07:01:31.6070982Z Stop one or more running containers 2024-06-26T07:01:31.6213988Z Total reclaimed space: 0B 2024-06-26T07:01:31.6254410Z ##[group]Run set +e 2024-06-26T07:01:31.6254903Z set +e 2024-06-26T07:01:31.6255329Z set -x 2024-06-26T07:01:31.6255764Z  2024-06-26T07:01:31.6256238Z PT_DOMAIN=download.pytorch.org 2024-06-26T07:01:31.6257430Z # TODO: Flaky access to download.pytorch.org https://github.com/pytorch/pytorch/issues/100400, 2024-06-26T07:01:31.6259157Z # cleaning this up once the issue is fixed. There are more than one resolved IP here, the last 2024-06-26T07:01:31.6260333Z # one is returned at random 2024-06-26T07:01:31.6261133Z RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" | tail -n1) 2024-06-26T07:01:31.6261930Z  2024-06-26T07:01:31.6262396Z if [ -z "${RESOLVED_IP}" ]; then 2024-06-26T07:01:31.6263302Z  echo "Couldn't resolve ${PT_DOMAIN}, retrying with Google DNS..." 2024-06-26T07:01:31.6264445Z  RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" @8.8.8.8 | tail -n1) 2024-06-26T07:01:31.6265299Z  2024-06-26T07:01:31.6265907Z  if [ -z "${RESOLVED_IP}" ]; then 2024-06-26T07:01:31.6266728Z  echo "Couldn't resolve ${PT_DOMAIN}, exiting..." 2024-06-26T07:01:31.6267484Z  exit 1 2024-06-26T07:01:31.6267934Z  fi 2024-06-26T07:01:31.6268335Z fi 2024-06-26T07:01:31.6268925Z  2024-06-26T07:01:31.6269457Z if grep -r "${PT_DOMAIN}" /etc/hosts; then 2024-06-26T07:01:31.6270220Z  # Clean up any old records first 2024-06-26T07:01:31.6270990Z  sudo sed -i "/${PT_DOMAIN}/d" /etc/hosts 2024-06-26T07:01:31.6271663Z fi 2024-06-26T07:01:31.6272065Z  2024-06-26T07:01:31.6272678Z echo "${RESOLVED_IP} ${PT_DOMAIN}" | sudo tee -a /etc/hosts 2024-06-26T07:01:31.6273673Z cat /etc/hosts 2024-06-26T07:01:31.6283752Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:31.6284499Z env: 2024-06-26T07:01:31.6285068Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:31.6285563Z ##[endgroup] 2024-06-26T07:01:31.6309776Z + PT_DOMAIN=download.pytorch.org 2024-06-26T07:01:31.6312650Z ++ dig -4 +short download.pytorch.org 2024-06-26T07:01:31.6313473Z ++ tail -n1 2024-06-26T07:01:31.6574475Z + RESOLVED_IP=18.160.10.76 2024-06-26T07:01:31.6574941Z + '[' -z 18.160.10.76 ']' 2024-06-26T07:01:31.6575436Z + grep -r download.pytorch.org /etc/hosts 2024-06-26T07:01:31.6580744Z + sudo sed -i /download.pytorch.org/d /etc/hosts 2024-06-26T07:01:31.6581297Z 18.160.10.36 download.pytorch.org 2024-06-26T07:01:31.6680321Z + echo '18.160.10.76 download.pytorch.org' 2024-06-26T07:01:31.6680944Z + sudo tee -a /etc/hosts 2024-06-26T07:01:31.6918505Z 18.160.10.76 download.pytorch.org 2024-06-26T07:01:31.6932970Z + cat /etc/hosts 2024-06-26T07:01:31.6937140Z 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 2024-06-26T07:01:31.6948997Z ::1 localhost6 localhost6.localdomain6 2024-06-26T07:01:31.6949500Z 18.160.10.76 download.pytorch.org 2024-06-26T07:01:31.7141778Z ##[group]Run pytorch/test-infra/.github/actions/calculate-docker-image@main 2024-06-26T07:01:31.7142387Z with: 2024-06-26T07:01:31.7143333Z docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7144420Z docker-build-dir: .ci/docker 2024-06-26T07:01:31.7144786Z working-directory: . 2024-06-26T07:01:31.7145242Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:31.7145846Z force-push: false 2024-06-26T07:01:31.7146136Z env: 2024-06-26T07:01:31.7146400Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:31.7146732Z ##[endgroup] 2024-06-26T07:01:31.7165942Z ##[group]Run set -ex 2024-06-26T07:01:31.7166298Z set -ex 2024-06-26T07:01:31.7166591Z  2024-06-26T07:01:31.7167194Z # If the docker build directory or the build script doesn't exist, the action will 2024-06-26T07:01:31.7168257Z # gracefully return the docker image name as it is. Pulling docker image in Linux 2024-06-26T07:01:31.7169052Z # job could then download the pre-built image as usual 2024-06-26T07:01:31.7169789Z if [[ ! -d "${DOCKER_BUILD_DIR}" ]] || [[ ! -f "${DOCKER_BUILD_DIR}/build.sh" ]]; then 2024-06-26T07:01:31.7170450Z  echo "skip=true" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7171078Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7171622Z  2024-06-26T07:01:31.7172128Z  echo "There is no Docker build script in ${REPO_NAME} repo, skipping..." 2024-06-26T07:01:31.7172751Z  exit 0 2024-06-26T07:01:31.7173042Z else 2024-06-26T07:01:31.7173414Z  echo "skip=false" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7173861Z fi 2024-06-26T07:01:31.7174130Z  2024-06-26T07:01:31.7174590Z if [[ "${DOCKER_IMAGE_NAME}" == *"${DOCKER_REGISTRY}/${REPO_NAME}"* ]]; then 2024-06-26T07:01:31.7175438Z  # The docker image name already includes the ECR prefix and tag, so we can just 2024-06-26T07:01:31.7176214Z  # use it as it is, but first let's extract the tag 2024-06-26T07:01:31.7176916Z  DOCKER_TAG=$(echo "${DOCKER_IMAGE_NAME}" | awk -F '[:,]' '{print $2}') 2024-06-26T07:01:31.7177633Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7178324Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7178876Z else 2024-06-26T07:01:31.7179304Z  DOCKER_TAG=$(git rev-parse HEAD:"${DOCKER_BUILD_DIR}") 2024-06-26T07:01:31.7179960Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7180972Z  echo "docker-image=${DOCKER_REGISTRY}/${REPO_NAME}/${DOCKER_IMAGE_NAME}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7181719Z fi 2024-06-26T07:01:31.7189386Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:31.7189883Z env: 2024-06-26T07:01:31.7190157Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:31.7190509Z REPO_NAME: pytorch 2024-06-26T07:01:31.7191489Z DOCKER_IMAGE_NAME: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7192549Z DOCKER_BUILD_DIR: .ci/docker 2024-06-26T07:01:31.7193048Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:31.7193558Z ##[endgroup] 2024-06-26T07:01:31.7214482Z + [[ ! -d .ci/docker ]] 2024-06-26T07:01:31.7215295Z + [[ ! -f .ci/docker/build.sh ]] 2024-06-26T07:01:31.7216040Z + echo skip=false 2024-06-26T07:01:31.7218891Z + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h* ]] 2024-06-26T07:01:31.7221458Z ++ echo 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7222512Z ++ awk -F '[:,]' '{print $2}' 2024-06-26T07:01:31.7225978Z + DOCKER_TAG=91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7226656Z + echo docker-tag=91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7228436Z + echo docker-image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7255798Z ##[group]Run set +e 2024-06-26T07:01:31.7256153Z set +e 2024-06-26T07:01:31.7256450Z set -x 2024-06-26T07:01:31.7256730Z  2024-06-26T07:01:31.7257012Z login() { 2024-06-26T07:01:31.7257666Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2024-06-26T07:01:31.7258377Z } 2024-06-26T07:01:31.7258647Z  2024-06-26T07:01:31.7258938Z retry () { 2024-06-26T07:01:31.7259323Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2024-06-26T07:01:31.7259766Z } 2024-06-26T07:01:31.7260035Z  2024-06-26T07:01:31.7260342Z retry login "${DOCKER_REGISTRY}" 2024-06-26T07:01:31.7260756Z  2024-06-26T07:01:31.7261228Z # Check if image already exists, if it does then skip building it 2024-06-26T07:01:31.7261937Z if docker manifest inspect "${DOCKER_IMAGE}"; then 2024-06-26T07:01:31.7262441Z  exit 0 2024-06-26T07:01:31.7262733Z fi 2024-06-26T07:01:31.7263005Z  2024-06-26T07:01:31.7263494Z # NB: This part requires a full checkout. Otherwise, the merge base will 2024-06-26T07:01:31.7264333Z # be empty. The default action would be to continue rebuild the image 2024-06-26T07:01:31.7265069Z if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then 2024-06-26T07:01:31.7265834Z  # if we're on the base branch then use the parent commit 2024-06-26T07:01:31.7266434Z  MERGE_BASE=$(git rev-parse HEAD~) 2024-06-26T07:01:31.7266860Z else 2024-06-26T07:01:31.7267317Z  # otherwise we're on a PR, so use the most recent base commit 2024-06-26T07:01:31.7268061Z  MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION") 2024-06-26T07:01:31.7268567Z fi 2024-06-26T07:01:31.7268835Z  2024-06-26T07:01:31.7269146Z if [[ -z "${MERGE_BASE}" ]]; then 2024-06-26T07:01:31.7269648Z  echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7270101Z  2024-06-26T07:01:31.7270747Z  echo "Finding merge base only works with full checkout, please set fetch-depth to 0, continuing ..." 2024-06-26T07:01:31.7271616Z  exit 0 2024-06-26T07:01:31.7271911Z fi 2024-06-26T07:01:31.7272174Z  2024-06-26T07:01:31.7272612Z if ! git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}"; then 2024-06-26T07:01:31.7273601Z  echo "Directory '${DOCKER_BUILD_DIR}' not found in commit $MERGE_BASE, you should rebase onto a more recent commit" 2024-06-26T07:01:31.7274396Z  exit 1 2024-06-26T07:01:31.7274690Z fi 2024-06-26T07:01:31.7274960Z  2024-06-26T07:01:31.7275432Z PREVIOUS_DOCKER_TAG=$(git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}") 2024-06-26T07:01:31.7276386Z # If no image exists but the hash is the same as the previous hash then we should error out here 2024-06-26T07:01:31.7277259Z if [[ "${PREVIOUS_DOCKER_TAG}" == "${DOCKER_TAG}" ]]; then 2024-06-26T07:01:31.7278240Z  echo "WARNING: Something has gone wrong and the previous image isn't available for the merge-base of your branch" 2024-06-26T07:01:31.7279353Z  echo " Will re-build docker image to store in local cache, TTS may be longer" 2024-06-26T07:01:31.7280022Z fi 2024-06-26T07:01:31.7280305Z  2024-06-26T07:01:31.7280732Z echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2024-06-26T07:01:31.7288302Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:31.7288800Z env: 2024-06-26T07:01:31.7289067Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:31.7289425Z DOCKER_BUILD_DIR: .ci/docker 2024-06-26T07:01:31.7289887Z BASE_REVISION: 4b9c9a9cc9c9283380a011310ba180c105c3dcb9 2024-06-26T07:01:31.7291009Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7292121Z DOCKER_TAG: 91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:31.7292713Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:31.7293228Z ##[endgroup] 2024-06-26T07:01:31.7312032Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:31.7312693Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:31.7313528Z + aws ecr get-login-password --region us-east-1 2024-06-26T07:01:31.7314684Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:32.1291241Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2024-06-26T07:01:32.1292262Z Configure a credential helper to remove this warning. See 2024-06-26T07:01:32.1293386Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2024-06-26T07:01:32.1294095Z 2024-06-26T07:01:32.1294275Z Login Succeeded 2024-06-26T07:01:32.1304608Z + docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:32.3215068Z { 2024-06-26T07:01:32.3215477Z "schemaVersion": 2, 2024-06-26T07:01:32.3216375Z "mediaType": "application/vnd.docker.distribution.manifest.v2+json", 2024-06-26T07:01:32.3217113Z "config": { 2024-06-26T07:01:32.3217585Z "mediaType": "application/vnd.docker.container.image.v1+json", 2024-06-26T07:01:32.3218135Z "size": 48450, 2024-06-26T07:01:32.3218776Z "digest": "sha256:275441d577b6501c7c3f0a3f3b304d8adc9ce71705f2cbe57152c9a00c123a71" 2024-06-26T07:01:32.3219397Z }, 2024-06-26T07:01:32.3219661Z "layers": [ 2024-06-26T07:01:32.3219936Z { 2024-06-26T07:01:32.3220366Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3220916Z "size": 28580681, 2024-06-26T07:01:32.3221499Z "digest": "sha256:7a2c559011895d255fce249c00396abff5ae7e0c0a92931d0ed493e71de78e3a" 2024-06-26T07:01:32.3222317Z }, 2024-06-26T07:01:32.3222675Z { 2024-06-26T07:01:32.3223142Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3223703Z "size": 7943451, 2024-06-26T07:01:32.3224537Z "digest": "sha256:224fe954d7252f10539d243d6c9688806f7d13ad775ed02e7f7c79077844510d" 2024-06-26T07:01:32.3225353Z }, 2024-06-26T07:01:32.3225852Z { 2024-06-26T07:01:32.3226454Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3227173Z "size": 55728572, 2024-06-26T07:01:32.3227886Z "digest": "sha256:75722010b82e31715876aeeed0b2cee414296f0124fdfa061ab845ba2a158450" 2024-06-26T07:01:32.3228679Z }, 2024-06-26T07:01:32.3229008Z { 2024-06-26T07:01:32.3229565Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3230284Z "size": 186, 2024-06-26T07:01:32.3230975Z "digest": "sha256:d527cbbb87e3016fd72a18a9b468c945ad0ca27c5770b39debd6ed704db3a195" 2024-06-26T07:01:32.3231774Z }, 2024-06-26T07:01:32.3232116Z { 2024-06-26T07:01:32.3232682Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3233421Z "size": 6886, 2024-06-26T07:01:32.3234005Z "digest": "sha256:b57676e46aee1a8c82e528d78e5a13e31142524eea31c8b213d69ddcb6f1fe80" 2024-06-26T07:01:32.3234607Z }, 2024-06-26T07:01:32.3234855Z { 2024-06-26T07:01:32.3235282Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3235964Z "size": 1329001756, 2024-06-26T07:01:32.3236543Z "digest": "sha256:a8c1e85b5e14cec7af70bf304cb4d4cee6a1d25eb8215b2cf4fdc33e5af5e108" 2024-06-26T07:01:32.3237172Z }, 2024-06-26T07:01:32.3237408Z { 2024-06-26T07:01:32.3237831Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3238384Z "size": 62501, 2024-06-26T07:01:32.3238908Z "digest": "sha256:a41a8d1c11c8d80fe4e82b0d05478f8d51176ff20b8350905fc1b25c93a51198" 2024-06-26T07:01:32.3239515Z }, 2024-06-26T07:01:32.3239758Z { 2024-06-26T07:01:32.3240180Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3240725Z "size": 1684, 2024-06-26T07:01:32.3241245Z "digest": "sha256:0c12278907551c2962927d27c115f6f7bf0df894318b8aea6ece3ef01ccd0a8a" 2024-06-26T07:01:32.3241840Z }, 2024-06-26T07:01:32.3242089Z { 2024-06-26T07:01:32.3242514Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3243063Z "size": 1523, 2024-06-26T07:01:32.3243615Z "digest": "sha256:d8d1234baab3ec9ccb8bb710fc6b8ff6c10896ba2e8d27a347583eca770f9ff1" 2024-06-26T07:01:32.3244251Z }, 2024-06-26T07:01:32.3244489Z { 2024-06-26T07:01:32.3245181Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3245817Z + exit 0 2024-06-26T07:01:32.3246078Z "size": 2528295403, 2024-06-26T07:01:32.3246638Z "digest": "sha256:7ed32bc8e4696fcdb2feef850781160597b2275ad756819c4add88236b0577d5" 2024-06-26T07:01:32.3247254Z }, 2024-06-26T07:01:32.3247498Z { 2024-06-26T07:01:32.3247925Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3248479Z "size": 86016, 2024-06-26T07:01:32.3249025Z "digest": "sha256:ec1e7978c1fe161ced1d98092a51e7c5953ca5fda5577f54df9dbda4afff1b2b" 2024-06-26T07:01:32.3249659Z }, 2024-06-26T07:01:32.3249906Z { 2024-06-26T07:01:32.3250325Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3250879Z "size": 1823, 2024-06-26T07:01:32.3251417Z "digest": "sha256:53dcf3fc784e25638bd8672f84820ec80123d18c54adbec58051987420636d8c" 2024-06-26T07:01:32.3252024Z }, 2024-06-26T07:01:32.3252279Z { 2024-06-26T07:01:32.3252707Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3253247Z "size": 246736948, 2024-06-26T07:01:32.3253782Z "digest": "sha256:0935fd0635f28805705fca8c7c546f64ff07980fa9a3b7151863647367bed15d" 2024-06-26T07:01:32.3254366Z }, 2024-06-26T07:01:32.3254606Z { 2024-06-26T07:01:32.3255029Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3255582Z "size": 546, 2024-06-26T07:01:32.3256100Z "digest": "sha256:244ea295e7fb5876c904e48012708b9c1ff4ffcaa63644888ca117a4d46c9f56" 2024-06-26T07:01:32.3256788Z }, 2024-06-26T07:01:32.3257033Z { 2024-06-26T07:01:32.3257448Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3258000Z "size": 1286, 2024-06-26T07:01:32.3258529Z "digest": "sha256:8872055c179e6ffca58af79d6a50926f7726c5fffe6989160dc9cfa8c27a1d4e" 2024-06-26T07:01:32.3259138Z }, 2024-06-26T07:01:32.3259373Z { 2024-06-26T07:01:32.3259798Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3260346Z "size": 485, 2024-06-26T07:01:32.3260874Z "digest": "sha256:b8e45ccefd130e4a14275395637bfc70bfb55cf38f03f10fd8d559cce9d067e1" 2024-06-26T07:01:32.3261481Z }, 2024-06-26T07:01:32.3261723Z { 2024-06-26T07:01:32.3262140Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3262689Z "size": 91710700, 2024-06-26T07:01:32.3263239Z "digest": "sha256:1d8219281e67bd4de633ba073107e3ceed4ee7cdca2f1fdf89a97e559c26796f" 2024-06-26T07:01:32.3263860Z }, 2024-06-26T07:01:32.3264104Z { 2024-06-26T07:01:32.3264531Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3265072Z "size": 3235, 2024-06-26T07:01:32.3265741Z "digest": "sha256:55dc2e88315a42a3b460af73f33d774060a6b78993db34b5225a50b2d1505fa2" 2024-06-26T07:01:32.3266350Z }, 2024-06-26T07:01:32.3266583Z { 2024-06-26T07:01:32.3267012Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3267569Z "size": 1905, 2024-06-26T07:01:32.3268089Z "digest": "sha256:17781119b281ef55fd019ab44ad1c2d5a5814f6f026e41a5d6411c34a248fea8" 2024-06-26T07:01:32.3268690Z }, 2024-06-26T07:01:32.3268929Z { 2024-06-26T07:01:32.3269346Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3269894Z "size": 699, 2024-06-26T07:01:32.3270422Z "digest": "sha256:c649f5998fff514b98f17e1a3f8c61066c6cfda238b1e7a465e22674162e8d4d" 2024-06-26T07:01:32.3271014Z }, 2024-06-26T07:01:32.3271262Z { 2024-06-26T07:01:32.3271692Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3272237Z "size": 2784792023, 2024-06-26T07:01:32.3272794Z "digest": "sha256:face7ff0cce657a334e8693efde0114d07275134a95ed40d5e8295e44faf074c" 2024-06-26T07:01:32.3273409Z }, 2024-06-26T07:01:32.3273646Z { 2024-06-26T07:01:32.3274073Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3274622Z "size": 381, 2024-06-26T07:01:32.3275142Z "digest": "sha256:e775bf742328067c8a63a9a8adc73012489415becb116a39baad9d0efaa05ea1" 2024-06-26T07:01:32.3275741Z }, 2024-06-26T07:01:32.3275986Z { 2024-06-26T07:01:32.3276402Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3276949Z "size": 11896, 2024-06-26T07:01:32.3277490Z "digest": "sha256:7b94382da9db5dcf1ff97cb741e2c8a9674f8bd49923a70e12ff951ddffb88c4" 2024-06-26T07:01:32.3278148Z }, 2024-06-26T07:01:32.3278391Z { 2024-06-26T07:01:32.3278813Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3279351Z "size": 803, 2024-06-26T07:01:32.3279884Z "digest": "sha256:e5fa11ef6656a7059ded11a1ba28561afb88e91d86edbef45fe420cfe29797f6" 2024-06-26T07:01:32.3280501Z }, 2024-06-26T07:01:32.3280737Z { 2024-06-26T07:01:32.3281173Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3281721Z "size": 106, 2024-06-26T07:01:32.3282254Z "digest": "sha256:9835eabf09ba5bc5d1e5ffbeefda2754dd06686a7b852a374fc1852d9731471a" 2024-06-26T07:01:32.3282881Z }, 2024-06-26T07:01:32.3283123Z { 2024-06-26T07:01:32.3283541Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3284087Z "size": 502, 2024-06-26T07:01:32.3284863Z "digest": "sha256:b76dc9b5ad60c5d7c18eddfdcd70e80b173e3197f6b177f167908259e0499249" 2024-06-26T07:01:32.3285491Z }, 2024-06-26T07:01:32.3285732Z { 2024-06-26T07:01:32.3286157Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3286811Z "size": 121477176, 2024-06-26T07:01:32.3287418Z "digest": "sha256:7ecf8de715a84da044df0d0b198f3b794efcdf7d49ea2f3c7d5f2657b756e2cf" 2024-06-26T07:01:32.3288095Z }, 2024-06-26T07:01:32.3288333Z { 2024-06-26T07:01:32.3288785Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3289366Z "size": 109, 2024-06-26T07:01:32.3289924Z "digest": "sha256:23da2d4005d33260ff3d4a5ca97dfde59f70a6825018f7c5096381d76182d0ae" 2024-06-26T07:01:32.3290581Z }, 2024-06-26T07:01:32.3290826Z { 2024-06-26T07:01:32.3291263Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3291851Z "size": 489, 2024-06-26T07:01:32.3292426Z "digest": "sha256:813dde637cf57f83d09d2b63a37c4ba4575dab0c3ab89e6955e022eac325ab05" 2024-06-26T07:01:32.3293085Z }, 2024-06-26T07:01:32.3293329Z { 2024-06-26T07:01:32.3293773Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3294360Z "size": 296, 2024-06-26T07:01:32.3294937Z "digest": "sha256:ad5f73a0c0a6c7a75b33d6ef0d020c9405fc6968dfe2b12c02f59014c5713a81" 2024-06-26T07:01:32.3295599Z }, 2024-06-26T07:01:32.3295836Z { 2024-06-26T07:01:32.3296353Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3296941Z "size": 103, 2024-06-26T07:01:32.3297499Z "digest": "sha256:1096382abe26eef405d7f8d86262e292f9aaaf5582dd4b9a79420479f23756d4" 2024-06-26T07:01:32.3298154Z }, 2024-06-26T07:01:32.3298398Z { 2024-06-26T07:01:32.3298832Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3299415Z "size": 1473, 2024-06-26T07:01:32.3299994Z "digest": "sha256:dde82f18c5d712be333f10918fbadbe89dd2d36a81380e1973c2685a6dfb72e3" 2024-06-26T07:01:32.3300651Z }, 2024-06-26T07:01:32.3300901Z { 2024-06-26T07:01:32.3301348Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3301933Z "size": 446710977, 2024-06-26T07:01:32.3302527Z "digest": "sha256:e5d823764d67cacd606578bc6afa95c3ad046de89058997c1c33369000289252" 2024-06-26T07:01:32.3303179Z }, 2024-06-26T07:01:32.3303418Z { 2024-06-26T07:01:32.3303862Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3304449Z "size": 160, 2024-06-26T07:01:32.3305003Z "digest": "sha256:2c6331d88964857d22a59bfbdb25cd4444585ca4449488c8bf27830ed2b06442" 2024-06-26T07:01:32.3305709Z }, 2024-06-26T07:01:32.3305953Z { 2024-06-26T07:01:32.3306371Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3306912Z "size": 565, 2024-06-26T07:01:32.3307431Z "digest": "sha256:1faf867e96897a2e83a950665be829ad18c3997a3273b720b2973218dfabfa76" 2024-06-26T07:01:32.3308020Z }, 2024-06-26T07:01:32.3308261Z { 2024-06-26T07:01:32.3308684Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3309227Z "size": 35874512, 2024-06-26T07:01:32.3309771Z "digest": "sha256:cbd7992bcbce464363388f40a66d4a9861753bff455c6a6a74517e334d70ad62" 2024-06-26T07:01:32.3310391Z }, 2024-06-26T07:01:32.3310628Z { 2024-06-26T07:01:32.3311052Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3311598Z "size": 104, 2024-06-26T07:01:32.3312133Z "digest": "sha256:70b6526a3ba18cc94f2ea6c01a69008adfcb71dddd40da85e936bffd6d2284fe" 2024-06-26T07:01:32.3312750Z }, 2024-06-26T07:01:32.3312991Z { 2024-06-26T07:01:32.3313407Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3313950Z "size": 425, 2024-06-26T07:01:32.3314474Z "digest": "sha256:2d8c70ea270e7de194268ddcc0786cd529eb3159818484b11e524f90b5104209" 2024-06-26T07:01:32.3315115Z }, 2024-06-26T07:01:32.3315429Z { 2024-06-26T07:01:32.3315910Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3316598Z "size": 20262086, 2024-06-26T07:01:32.3317144Z "digest": "sha256:e27f286cee799880d007c67787eea4738b399a729aeeedf1584f0285b77578b4" 2024-06-26T07:01:32.3317870Z }, 2024-06-26T07:01:32.3318108Z { 2024-06-26T07:01:32.3318586Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3319139Z "size": 439, 2024-06-26T07:01:32.3319670Z "digest": "sha256:de70e49a22399ad44315dec208b9c685b6d74db46f252586847f06bd1b79b1c3" 2024-06-26T07:01:32.3320281Z }, 2024-06-26T07:01:32.3320517Z { 2024-06-26T07:01:32.3320942Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3321491Z "size": 699, 2024-06-26T07:01:32.3322008Z "digest": "sha256:c649f5998fff514b98f17e1a3f8c61066c6cfda238b1e7a465e22674162e8d4d" 2024-06-26T07:01:32.3322611Z }, 2024-06-26T07:01:32.3322862Z { 2024-06-26T07:01:32.3323278Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3323830Z "size": 143, 2024-06-26T07:01:32.3324350Z "digest": "sha256:5c0d093307564f2fcc700954ce6d904710bff6d2c828072656e3d855d59cbb08" 2024-06-26T07:01:32.3325108Z }, 2024-06-26T07:01:32.3325368Z { 2024-06-26T07:01:32.3325797Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3326342Z "size": 135, 2024-06-26T07:01:32.3326966Z "digest": "sha256:cd6bcbb42af6023d87ba22bf3d803903cd3978712c3f17e79a49041c4ee0ff81" 2024-06-26T07:01:32.3327589Z }, 2024-06-26T07:01:32.3327829Z { 2024-06-26T07:01:32.3328260Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3328823Z "size": 32, 2024-06-26T07:01:32.3329358Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-06-26T07:01:32.3329977Z }, 2024-06-26T07:01:32.3330218Z { 2024-06-26T07:01:32.3330641Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3331188Z "size": 191, 2024-06-26T07:01:32.3331731Z "digest": "sha256:7c13ee612ccc9922ee7cd50b8633b7dc19becc187e7cf86c37ce1ac19451b0aa" 2024-06-26T07:01:32.3332347Z }, 2024-06-26T07:01:32.3332597Z { 2024-06-26T07:01:32.3333033Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3333575Z "size": 564, 2024-06-26T07:01:32.3334101Z "digest": "sha256:25b1f081a649ec50b0da772909ebc01bba75e2afc293552c8a086f7fe0e76813" 2024-06-26T07:01:32.3334711Z }, 2024-06-26T07:01:32.3334962Z { 2024-06-26T07:01:32.3335390Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3335936Z "size": 43163111, 2024-06-26T07:01:32.3336482Z "digest": "sha256:ebd46885b2bf14e156da307176e4d83282cba93f6c6110c65291e7a54db65e07" 2024-06-26T07:01:32.3337094Z }, 2024-06-26T07:01:32.3337336Z { 2024-06-26T07:01:32.3337756Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3338306Z "size": 106, 2024-06-26T07:01:32.3338840Z "digest": "sha256:380d4036734b6be26b98ebe3fef99ad8f13ed5f78d046d3d25b3aa0e2f71782f" 2024-06-26T07:01:32.3339449Z }, 2024-06-26T07:01:32.3339694Z { 2024-06-26T07:01:32.3340118Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3340686Z "size": 1212, 2024-06-26T07:01:32.3341220Z "digest": "sha256:0578257b9245cf313dbf0eea07b6ed289b91f9c20a405f26a5c0a8e5895fe0b5" 2024-06-26T07:01:32.3341822Z }, 2024-06-26T07:01:32.3342067Z { 2024-06-26T07:01:32.3342507Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3343059Z "size": 699, 2024-06-26T07:01:32.3343580Z "digest": "sha256:c649f5998fff514b98f17e1a3f8c61066c6cfda238b1e7a465e22674162e8d4d" 2024-06-26T07:01:32.3344193Z }, 2024-06-26T07:01:32.3344443Z { 2024-06-26T07:01:32.3344864Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3345417Z "size": 137, 2024-06-26T07:01:32.3345999Z "digest": "sha256:f65b3a065edbaf29202f5df2f5f38db52c162ff4915f3d0a4e2d268374bdccde" 2024-06-26T07:01:32.3346607Z }, 2024-06-26T07:01:32.3346849Z { 2024-06-26T07:01:32.3347271Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3347883Z "size": 120, 2024-06-26T07:01:32.3348410Z "digest": "sha256:cca983685e21446707cac342df20a56b1d148ff7fbf6d505103a80641b2856e9" 2024-06-26T07:01:32.3349009Z }, 2024-06-26T07:01:32.3349245Z { 2024-06-26T07:01:32.3349681Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3350233Z "size": 1905181001, 2024-06-26T07:01:32.3350778Z "digest": "sha256:349f4b6c9f1664f950682d2206c53ff3e8b553ca933ea50b9ba25a3ab16c8e58" 2024-06-26T07:01:32.3351372Z }, 2024-06-26T07:01:32.3351612Z { 2024-06-26T07:01:32.3352033Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3352578Z "size": 172, 2024-06-26T07:01:32.3353103Z "digest": "sha256:232f4ffc1ad632687cff13f27a571db5da95c8109f3e7389f83c713224015df3" 2024-06-26T07:01:32.3353700Z }, 2024-06-26T07:01:32.3353942Z { 2024-06-26T07:01:32.3354364Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3354900Z "size": 908, 2024-06-26T07:01:32.3355430Z "digest": "sha256:58ba66ea2bf31b755d42120c3e896c50c46de6166241aa0a4a062d4e675d6ffc" 2024-06-26T07:01:32.3356029Z }, 2024-06-26T07:01:32.3356266Z { 2024-06-26T07:01:32.3356694Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3357294Z "size": 699, 2024-06-26T07:01:32.3357815Z "digest": "sha256:c649f5998fff514b98f17e1a3f8c61066c6cfda238b1e7a465e22674162e8d4d" 2024-06-26T07:01:32.3358414Z }, 2024-06-26T07:01:32.3358657Z { 2024-06-26T07:01:32.3359074Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3359618Z "size": 134, 2024-06-26T07:01:32.3360129Z "digest": "sha256:8726153a09113618974da9f11ccdefe524419e865995ef989c24e2c99300e0cc" 2024-06-26T07:01:32.3360711Z }, 2024-06-26T07:01:32.3360956Z { 2024-06-26T07:01:32.3361382Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3361923Z "size": 32, 2024-06-26T07:01:32.3362452Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-06-26T07:01:32.3363072Z }, 2024-06-26T07:01:32.3363305Z { 2024-06-26T07:01:32.3363727Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3364272Z "size": 156, 2024-06-26T07:01:32.3364915Z "digest": "sha256:ac34288e6c8e16dd823a7ffa47b41149ffd78ceeb334922b3f8e47448bd4042c" 2024-06-26T07:01:32.3365521Z }, 2024-06-26T07:01:32.3365770Z { 2024-06-26T07:01:32.3366187Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3366729Z "size": 1841, 2024-06-26T07:01:32.3367267Z "digest": "sha256:663a07bccf9ee6c16f5d24ff6c67aad73dfa2f3f323920dd71a040f37206b7b2" 2024-06-26T07:01:32.3367870Z }, 2024-06-26T07:01:32.3368115Z { 2024-06-26T07:01:32.3368539Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3369073Z "size": 7529780, 2024-06-26T07:01:32.3369585Z "digest": "sha256:7312121199e8b7b471461a896d6c6088b5b7f86700f7204a1b569046107bc656" 2024-06-26T07:01:32.3370168Z }, 2024-06-26T07:01:32.3370402Z { 2024-06-26T07:01:32.3370822Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3371363Z "size": 165, 2024-06-26T07:01:32.3371899Z "digest": "sha256:287ac2ec71fbe124f4df1f5fd9d37aa7a43b0caff7e9200e9735f76db837e310" 2024-06-26T07:01:32.3372515Z }, 2024-06-26T07:01:32.3372755Z { 2024-06-26T07:01:32.3373174Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3373716Z "size": 7944, 2024-06-26T07:01:32.3374247Z "digest": "sha256:d996b9174ec3d88306d067b17ed26fb0ffbcd478da1c7e947d575c976f00f4bd" 2024-06-26T07:01:32.3374848Z }, 2024-06-26T07:01:32.3375089Z { 2024-06-26T07:01:32.3375508Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3376050Z "size": 8072, 2024-06-26T07:01:32.3376565Z "digest": "sha256:f5e21892ce122c0b70256134c447d20036c7c9a67c04b9318df137814656c22c" 2024-06-26T07:01:32.3377236Z }, 2024-06-26T07:01:32.3377475Z { 2024-06-26T07:01:32.3377897Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3378440Z "size": 303, 2024-06-26T07:01:32.3378962Z "digest": "sha256:11bfa431de03304dd16563d15e2b1a1d5830043b83b941aa750bc9e011bc6f77" 2024-06-26T07:01:32.3379566Z }, 2024-06-26T07:01:32.3379812Z { 2024-06-26T07:01:32.3380228Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3380777Z "size": 7629008, 2024-06-26T07:01:32.3381318Z "digest": "sha256:2ff96c5eb373de25f91190d2cd3257d530ddd3711155dcff363f71085b664b15" 2024-06-26T07:01:32.3381911Z }, 2024-06-26T07:01:32.3382153Z { 2024-06-26T07:01:32.3382574Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3383106Z "size": 108, 2024-06-26T07:01:32.3383635Z "digest": "sha256:9f78da6bf632307ab28ab5b11452f6c5cdbc5f53035fbde46db6b33fe6fb3024" 2024-06-26T07:01:32.3384243Z }, 2024-06-26T07:01:32.3384485Z { 2024-06-26T07:01:32.3384906Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3385450Z "size": 54145774, 2024-06-26T07:01:32.3386049Z "digest": "sha256:299e957bfa95a77cadcd43db0fe57bfe3388da9439aa075490fb48ab57bc090d" 2024-06-26T07:01:32.3386740Z }, 2024-06-26T07:01:32.3386989Z { 2024-06-26T07:01:32.3387408Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3388003Z "size": 474, 2024-06-26T07:01:32.3388523Z "digest": "sha256:809fbdc1d051e7abdcfc00816306128493ef9717bd1f60563182a7618f0cb56c" 2024-06-26T07:01:32.3389116Z }, 2024-06-26T07:01:32.3389359Z { 2024-06-26T07:01:32.3389783Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3390329Z "size": 1374858877, 2024-06-26T07:01:32.3390877Z "digest": "sha256:57ac44bb7f0628cdc864c4f361d45881a86a3c94f651ff70f10369fa9c57bf16" 2024-06-26T07:01:32.3391477Z }, 2024-06-26T07:01:32.3391723Z { 2024-06-26T07:01:32.3392158Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3392702Z "size": 106, 2024-06-26T07:01:32.3393219Z "digest": "sha256:9efc8d031b86fa0d1235283147985e59284de23cd3293f021af2c7dd7621c329" 2024-06-26T07:01:32.3393814Z }, 2024-06-26T07:01:32.3394068Z { 2024-06-26T07:01:32.3394488Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3395030Z "size": 558, 2024-06-26T07:01:32.3395546Z "digest": "sha256:338f5515764f88e0cdb63f01b8d2837ce9d349cb355f984c89c57a80fa43e85c" 2024-06-26T07:01:32.3396151Z }, 2024-06-26T07:01:32.3396385Z { 2024-06-26T07:01:32.3396817Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3397365Z "size": 46248556, 2024-06-26T07:01:32.3397916Z "digest": "sha256:d2cda4294dfa14a67ffb37676ffdbb1a6503405aade953aa438a4d0b33f427e8" 2024-06-26T07:01:32.3398536Z }, 2024-06-26T07:01:32.3398779Z { 2024-06-26T07:01:32.3399214Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3399765Z "size": 111, 2024-06-26T07:01:32.3400295Z "digest": "sha256:66b50dcc6ac68472d333ed54d2c38b24b490e0bf0ca929186faffe4634693c56" 2024-06-26T07:01:32.3400906Z }, 2024-06-26T07:01:32.3401149Z { 2024-06-26T07:01:32.3401586Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3402122Z "size": 32, 2024-06-26T07:01:32.3402657Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-06-26T07:01:32.3403263Z }, 2024-06-26T07:01:32.3403497Z { 2024-06-26T07:01:32.3403919Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3417270Z "size": 32, 2024-06-26T07:01:32.3417830Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-06-26T07:01:32.3418462Z }, 2024-06-26T07:01:32.3418700Z { 2024-06-26T07:01:32.3419143Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3419820Z "size": 32, 2024-06-26T07:01:32.3420354Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-06-26T07:01:32.3420977Z }, 2024-06-26T07:01:32.3421219Z { 2024-06-26T07:01:32.3421647Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2024-06-26T07:01:32.3422202Z "size": 32, 2024-06-26T07:01:32.3422730Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2024-06-26T07:01:32.3423338Z } 2024-06-26T07:01:32.3423581Z ] 2024-06-26T07:01:32.3423821Z } 2024-06-26T07:01:32.3527216Z ##[group]Run tag=${ECR_DOCKER_IMAGE##*/} 2024-06-26T07:01:32.3527715Z tag=${ECR_DOCKER_IMAGE##*/} 2024-06-26T07:01:32.3528267Z echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}" 2024-06-26T07:01:32.3536128Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:32.3536635Z env: 2024-06-26T07:01:32.3536908Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:32.3537971Z ECR_DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:32.3539038Z ##[endgroup] 2024-06-26T07:01:32.3562234Z docker pull ghcr.io/pytorch/ci-image:pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:32.3612664Z ##[group]Run pytorch/test-infra/.github/actions/pull-docker-image@main 2024-06-26T07:01:32.3613240Z with: 2024-06-26T07:01:32.3614173Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:32.3615351Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:32.3615852Z env: 2024-06-26T07:01:32.3616129Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:32.3616474Z ##[endgroup] 2024-06-26T07:01:32.3633863Z ##[group]Run set -x 2024-06-26T07:01:32.3634205Z set -x 2024-06-26T07:01:32.3634506Z set +e 2024-06-26T07:01:32.3634806Z  2024-06-26T07:01:32.3635076Z login() { 2024-06-26T07:01:32.3635728Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2024-06-26T07:01:32.3636451Z } 2024-06-26T07:01:32.3636720Z  2024-06-26T07:01:32.3637021Z retry () { 2024-06-26T07:01:32.3637400Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2024-06-26T07:01:32.3637845Z } 2024-06-26T07:01:32.3638122Z  2024-06-26T07:01:32.3638427Z retry login "${DOCKER_REGISTRY}" 2024-06-26T07:01:32.3638838Z  2024-06-26T07:01:32.3639108Z set -e 2024-06-26T07:01:32.3639577Z # ignore output since only exit code is used for conditional 2024-06-26T07:01:32.3640285Z # only pull docker image if it's not available locally 2024-06-26T07:01:32.3641058Z if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then 2024-06-26T07:01:32.3641767Z  retry docker pull "${DOCKER_IMAGE}" 2024-06-26T07:01:32.3642197Z fi 2024-06-26T07:01:32.3649231Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:01:32.3649727Z env: 2024-06-26T07:01:32.3650001Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:01:32.3651008Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:32.3652185Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:32.3652691Z ##[endgroup] 2024-06-26T07:01:32.3671191Z + set +e 2024-06-26T07:01:32.3672118Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:32.3673020Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:32.3674050Z + aws ecr get-login-password --region us-east-1 2024-06-26T07:01:32.3675345Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2024-06-26T07:01:32.7768611Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2024-06-26T07:01:32.7769807Z Configure a credential helper to remove this warning. See 2024-06-26T07:01:32.7770774Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2024-06-26T07:01:32.7771310Z 2024-06-26T07:01:32.7771443Z Login Succeeded 2024-06-26T07:01:32.7780125Z + set -e 2024-06-26T07:01:32.7781412Z + docker inspect --type=image 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:32.7893631Z + retry docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:32.7895500Z + docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:01:33.0292443Z 91382da70d5719cd7007b6b80b71d2f48398f6b7: Pulling from pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9 2024-06-26T07:01:33.0298251Z 7a2c55901189: Pulling fs layer 2024-06-26T07:01:33.0298916Z 224fe954d725: Pulling fs layer 2024-06-26T07:01:33.0299543Z 75722010b82e: Pulling fs layer 2024-06-26T07:01:33.0300182Z d527cbbb87e3: Pulling fs layer 2024-06-26T07:01:33.0301252Z b57676e46aee: Pulling fs layer 2024-06-26T07:01:33.0303625Z a8c1e85b5e14: Pulling fs layer 2024-06-26T07:01:33.0304310Z a41a8d1c11c8: Pulling fs layer 2024-06-26T07:01:33.0304916Z 0c1227890755: Pulling fs layer 2024-06-26T07:01:33.0306970Z d8d1234baab3: Pulling fs layer 2024-06-26T07:01:33.0308056Z 7ed32bc8e469: Pulling fs layer 2024-06-26T07:01:33.0308721Z ec1e7978c1fe: Pulling fs layer 2024-06-26T07:01:33.0309395Z 53dcf3fc784e: Pulling fs layer 2024-06-26T07:01:33.0309987Z 0935fd0635f2: Pulling fs layer 2024-06-26T07:01:33.0310513Z 244ea295e7fb: Pulling fs layer 2024-06-26T07:01:33.0310968Z 8872055c179e: Pulling fs layer 2024-06-26T07:01:33.0311517Z b8e45ccefd13: Pulling fs layer 2024-06-26T07:01:33.0311984Z 1d8219281e67: Pulling fs layer 2024-06-26T07:01:33.0312435Z 55dc2e88315a: Pulling fs layer 2024-06-26T07:01:33.0312945Z b57676e46aee: Waiting 2024-06-26T07:01:33.0313523Z 17781119b281: Pulling fs layer 2024-06-26T07:01:33.0313978Z c649f5998fff: Pulling fs layer 2024-06-26T07:01:33.0314415Z a8c1e85b5e14: Waiting 2024-06-26T07:01:33.0314831Z face7ff0cce6: Pulling fs layer 2024-06-26T07:01:33.0315282Z e775bf742328: Pulling fs layer 2024-06-26T07:01:33.0315707Z a41a8d1c11c8: Waiting 2024-06-26T07:01:33.0316212Z 7b94382da9db: Pulling fs layer 2024-06-26T07:01:33.0316789Z 0c1227890755: Waiting 2024-06-26T07:01:33.0317273Z e5fa11ef6656: Pulling fs layer 2024-06-26T07:01:33.0317901Z d8d1234baab3: Waiting 2024-06-26T07:01:33.0318362Z 9835eabf09ba: Pulling fs layer 2024-06-26T07:01:33.0318851Z b76dc9b5ad60: Pulling fs layer 2024-06-26T07:01:33.0319310Z 7ecf8de715a8: Pulling fs layer 2024-06-26T07:01:33.0319764Z 23da2d4005d3: Pulling fs layer 2024-06-26T07:01:33.0320248Z 813dde637cf5: Pulling fs layer 2024-06-26T07:01:33.0320689Z b8e45ccefd13: Waiting 2024-06-26T07:01:33.0321104Z ad5f73a0c0a6: Pulling fs layer 2024-06-26T07:01:33.0321626Z 1d8219281e67: Waiting 2024-06-26T07:01:33.0322020Z 244ea295e7fb: Waiting 2024-06-26T07:01:33.0322427Z 1096382abe26: Pulling fs layer 2024-06-26T07:01:33.0322930Z 8872055c179e: Waiting 2024-06-26T07:01:33.0323339Z dde82f18c5d7: Pulling fs layer 2024-06-26T07:01:33.0323792Z e5d823764d67: Pulling fs layer 2024-06-26T07:01:33.0324301Z 2c6331d88964: Pulling fs layer 2024-06-26T07:01:33.0324930Z 1faf867e9689: Pulling fs layer 2024-06-26T07:01:33.0325384Z cbd7992bcbce: Pulling fs layer 2024-06-26T07:01:33.0325936Z 7ed32bc8e469: Waiting 2024-06-26T07:01:33.0326345Z 70b6526a3ba1: Pulling fs layer 2024-06-26T07:01:33.0326909Z 2d8c70ea270e: Pulling fs layer 2024-06-26T07:01:33.0327352Z e27f286cee79: Pulling fs layer 2024-06-26T07:01:33.0327784Z 55dc2e88315a: Waiting 2024-06-26T07:01:33.0328188Z de70e49a2239: Pulling fs layer 2024-06-26T07:01:33.0328615Z ec1e7978c1fe: Waiting 2024-06-26T07:01:33.0329159Z 17781119b281: Waiting 2024-06-26T07:01:33.0329543Z 23da2d4005d3: Waiting 2024-06-26T07:01:33.0329940Z 5c0d09330756: Pulling fs layer 2024-06-26T07:01:33.0330372Z 53dcf3fc784e: Waiting 2024-06-26T07:01:33.0330761Z 813dde637cf5: Waiting 2024-06-26T07:01:33.0331138Z 0935fd0635f2: Waiting 2024-06-26T07:01:33.0331531Z 7ecf8de715a8: Waiting 2024-06-26T07:01:33.0331945Z cd6bcbb42af6: Pulling fs layer 2024-06-26T07:01:33.0332378Z b76dc9b5ad60: Waiting 2024-06-26T07:01:33.0332872Z 4f4fb700ef54: Pulling fs layer 2024-06-26T07:01:33.0333440Z e775bf742328: Waiting 2024-06-26T07:01:33.0333840Z 7c13ee612ccc: Pulling fs layer 2024-06-26T07:01:33.0334295Z 25b1f081a649: Pulling fs layer 2024-06-26T07:01:33.0334793Z e5fa11ef6656: Waiting 2024-06-26T07:01:33.0335195Z ebd46885b2bf: Pulling fs layer 2024-06-26T07:01:33.0335632Z d527cbbb87e3: Waiting 2024-06-26T07:01:33.0336033Z 380d4036734b: Pulling fs layer 2024-06-26T07:01:33.0336469Z 0578257b9245: Pulling fs layer 2024-06-26T07:01:33.0336924Z f65b3a065edb: Pulling fs layer 2024-06-26T07:01:33.0337371Z cca983685e21: Pulling fs layer 2024-06-26T07:01:33.0337810Z 349f4b6c9f16: Pulling fs layer 2024-06-26T07:01:33.0338244Z 1096382abe26: Waiting 2024-06-26T07:01:33.0338631Z ebd46885b2bf: Waiting 2024-06-26T07:01:33.0339016Z dde82f18c5d7: Waiting 2024-06-26T07:01:33.0339542Z 232f4ffc1ad6: Pulling fs layer 2024-06-26T07:01:33.0339978Z e5d823764d67: Waiting 2024-06-26T07:01:33.0340374Z 58ba66ea2bf3: Pulling fs layer 2024-06-26T07:01:33.0340809Z 7c13ee612ccc: Waiting 2024-06-26T07:01:33.0341195Z 25b1f081a649: Waiting 2024-06-26T07:01:33.0341575Z 380d4036734b: Waiting 2024-06-26T07:01:33.0341957Z 0578257b9245: Waiting 2024-06-26T07:01:33.0342341Z 5c0d09330756: Waiting 2024-06-26T07:01:33.0342732Z 8726153a0911: Pulling fs layer 2024-06-26T07:01:33.0343181Z ac34288e6c8e: Pulling fs layer 2024-06-26T07:01:33.0343614Z e27f286cee79: Waiting 2024-06-26T07:01:33.0343992Z 4f4fb700ef54: Waiting 2024-06-26T07:01:33.0344378Z de70e49a2239: Waiting 2024-06-26T07:01:33.0344787Z 663a07bccf9e: Pulling fs layer 2024-06-26T07:01:33.0345216Z cca983685e21: Waiting 2024-06-26T07:01:33.0345717Z c649f5998fff: Waiting 2024-06-26T07:01:33.0346109Z 232f4ffc1ad6: Waiting 2024-06-26T07:01:33.0346489Z 349f4b6c9f16: Waiting 2024-06-26T07:01:33.0346966Z 7312121199e8: Pulling fs layer 2024-06-26T07:01:33.0347463Z 2c6331d88964: Waiting 2024-06-26T07:01:33.0347859Z 287ac2ec71fb: Pulling fs layer 2024-06-26T07:01:33.0348309Z d996b9174ec3: Pulling fs layer 2024-06-26T07:01:33.0348756Z f5e21892ce12: Pulling fs layer 2024-06-26T07:01:33.0349181Z 2d8c70ea270e: Waiting 2024-06-26T07:01:33.0349575Z cd6bcbb42af6: Waiting 2024-06-26T07:01:33.0349976Z 11bfa431de03: Pulling fs layer 2024-06-26T07:01:33.0350418Z cbd7992bcbce: Waiting 2024-06-26T07:01:33.0350832Z 2ff96c5eb373: Pulling fs layer 2024-06-26T07:01:33.0351255Z 8726153a0911: Waiting 2024-06-26T07:01:33.0351640Z 70b6526a3ba1: Waiting 2024-06-26T07:01:33.0352044Z 9f78da6bf632: Pulling fs layer 2024-06-26T07:01:33.0352469Z 7312121199e8: Waiting 2024-06-26T07:01:33.0352871Z 299e957bfa95: Pulling fs layer 2024-06-26T07:01:33.0353306Z 11bfa431de03: Waiting 2024-06-26T07:01:33.0353700Z 809fbdc1d051: Pulling fs layer 2024-06-26T07:01:33.0354146Z 57ac44bb7f06: Pulling fs layer 2024-06-26T07:01:33.0354580Z 2ff96c5eb373: Waiting 2024-06-26T07:01:33.0354983Z 9efc8d031b86: Pulling fs layer 2024-06-26T07:01:33.0355418Z 1faf867e9689: Waiting 2024-06-26T07:01:33.0355816Z 809fbdc1d051: Waiting 2024-06-26T07:01:33.0356195Z 299e957bfa95: Waiting 2024-06-26T07:01:33.0356594Z 338f5515764f: Pulling fs layer 2024-06-26T07:01:33.0357026Z 57ac44bb7f06: Waiting 2024-06-26T07:01:33.0357422Z d2cda4294dfa: Pulling fs layer 2024-06-26T07:01:33.0357876Z 66b50dcc6ac6: Pulling fs layer 2024-06-26T07:01:33.0358318Z f65b3a065edb: Waiting 2024-06-26T07:01:33.0358698Z 338f5515764f: Waiting 2024-06-26T07:01:33.0359084Z d2cda4294dfa: Waiting 2024-06-26T07:01:33.0359473Z 287ac2ec71fb: Waiting 2024-06-26T07:01:33.0359852Z d996b9174ec3: Waiting 2024-06-26T07:01:33.0360244Z 9f78da6bf632: Waiting 2024-06-26T07:01:33.0360706Z face7ff0cce6: Waiting 2024-06-26T07:01:33.0361088Z f5e21892ce12: Waiting 2024-06-26T07:01:33.1682315Z 224fe954d725: Verifying Checksum 2024-06-26T07:01:33.1683181Z 224fe954d725: Download complete 2024-06-26T07:01:33.2483742Z d527cbbb87e3: Download complete 2024-06-26T07:01:33.3291367Z b57676e46aee: Verifying Checksum 2024-06-26T07:01:33.3292011Z b57676e46aee: Download complete 2024-06-26T07:01:33.3837196Z 7a2c55901189: Verifying Checksum 2024-06-26T07:01:33.3837989Z 7a2c55901189: Download complete 2024-06-26T07:01:33.4494825Z a41a8d1c11c8: Verifying Checksum 2024-06-26T07:01:33.4495414Z a41a8d1c11c8: Download complete 2024-06-26T07:01:33.5251079Z 0c1227890755: Verifying Checksum 2024-06-26T07:01:33.5251796Z 0c1227890755: Download complete 2024-06-26T07:01:33.6043815Z d8d1234baab3: Download complete 2024-06-26T07:01:33.6527931Z 75722010b82e: Verifying Checksum 2024-06-26T07:01:33.6528607Z 75722010b82e: Download complete 2024-06-26T07:01:33.7361497Z ec1e7978c1fe: Download complete 2024-06-26T07:01:33.8192415Z 53dcf3fc784e: Download complete 2024-06-26T07:01:34.4109436Z 7a2c55901189: Pull complete 2024-06-26T07:01:34.6925119Z 224fe954d725: Pull complete 2024-06-26T07:01:35.6205947Z 75722010b82e: Pull complete 2024-06-26T07:01:35.6616814Z d527cbbb87e3: Pull complete 2024-06-26T07:01:35.7005676Z b57676e46aee: Pull complete 2024-06-26T07:01:36.3430489Z 0935fd0635f2: Verifying Checksum 2024-06-26T07:01:36.3430955Z 0935fd0635f2: Download complete 2024-06-26T07:01:36.4279726Z 244ea295e7fb: Download complete 2024-06-26T07:01:36.5095797Z 8872055c179e: Verifying Checksum 2024-06-26T07:01:36.5096250Z 8872055c179e: Download complete 2024-06-26T07:01:36.5917522Z b8e45ccefd13: Verifying Checksum 2024-06-26T07:01:36.5918260Z b8e45ccefd13: Download complete 2024-06-26T07:01:37.5611479Z 1d8219281e67: Download complete 2024-06-26T07:01:37.6330991Z 55dc2e88315a: Download complete 2024-06-26T07:01:37.7085256Z 17781119b281: Verifying Checksum 2024-06-26T07:01:37.7085841Z 17781119b281: Download complete 2024-06-26T07:01:37.7986578Z c649f5998fff: Verifying Checksum 2024-06-26T07:01:37.7987118Z c649f5998fff: Download complete 2024-06-26T07:01:46.6699683Z a8c1e85b5e14: Verifying Checksum 2024-06-26T07:01:46.6700316Z a8c1e85b5e14: Download complete 2024-06-26T07:01:46.7492421Z e775bf742328: Verifying Checksum 2024-06-26T07:01:46.7493049Z e775bf742328: Download complete 2024-06-26T07:01:46.8164237Z 7b94382da9db: Download complete 2024-06-26T07:01:46.8853281Z e5fa11ef6656: Verifying Checksum 2024-06-26T07:01:46.8853733Z e5fa11ef6656: Download complete 2024-06-26T07:01:46.9617291Z 9835eabf09ba: Verifying Checksum 2024-06-26T07:01:46.9618037Z 9835eabf09ba: Download complete 2024-06-26T07:01:47.0324299Z b76dc9b5ad60: Verifying Checksum 2024-06-26T07:01:47.0324958Z b76dc9b5ad60: Download complete 2024-06-26T07:01:48.2917119Z 7ecf8de715a8: Verifying Checksum 2024-06-26T07:01:48.2917600Z 7ecf8de715a8: Download complete 2024-06-26T07:01:48.3613503Z 23da2d4005d3: Verifying Checksum 2024-06-26T07:01:48.3614190Z 23da2d4005d3: Download complete 2024-06-26T07:01:48.4274964Z 813dde637cf5: Verifying Checksum 2024-06-26T07:01:48.4275463Z 813dde637cf5: Download complete 2024-06-26T07:01:48.4904190Z ad5f73a0c0a6: Download complete 2024-06-26T07:01:48.5531652Z 1096382abe26: Verifying Checksum 2024-06-26T07:01:48.5532271Z 1096382abe26: Download complete 2024-06-26T07:01:48.6222244Z dde82f18c5d7: Verifying Checksum 2024-06-26T07:01:48.6222795Z dde82f18c5d7: Download complete 2024-06-26T07:01:53.1489622Z e5d823764d67: Verifying Checksum 2024-06-26T07:01:53.1490268Z e5d823764d67: Download complete 2024-06-26T07:01:53.2417143Z 2c6331d88964: Verifying Checksum 2024-06-26T07:01:53.2417720Z 2c6331d88964: Download complete 2024-06-26T07:01:53.3125608Z 1faf867e9689: Verifying Checksum 2024-06-26T07:01:53.3126163Z 1faf867e9689: Download complete 2024-06-26T07:01:53.7226283Z cbd7992bcbce: Verifying Checksum 2024-06-26T07:01:53.7226832Z cbd7992bcbce: Download complete 2024-06-26T07:01:53.8087022Z 70b6526a3ba1: Verifying Checksum 2024-06-26T07:01:53.8087811Z 70b6526a3ba1: Download complete 2024-06-26T07:01:53.8873850Z 2d8c70ea270e: Download complete 2024-06-26T07:01:54.1562891Z e27f286cee79: Verifying Checksum 2024-06-26T07:01:54.1563517Z e27f286cee79: Download complete 2024-06-26T07:01:54.2248292Z de70e49a2239: Verifying Checksum 2024-06-26T07:01:54.2248885Z de70e49a2239: Download complete 2024-06-26T07:01:54.3309764Z 5c0d09330756: Verifying Checksum 2024-06-26T07:01:54.3310294Z 5c0d09330756: Download complete 2024-06-26T07:01:54.3966171Z cd6bcbb42af6: Download complete 2024-06-26T07:01:54.4068769Z 4f4fb700ef54: Verifying Checksum 2024-06-26T07:01:54.4069317Z 4f4fb700ef54: Download complete 2024-06-26T07:01:54.4766590Z 7c13ee612ccc: Verifying Checksum 2024-06-26T07:01:54.4767166Z 7c13ee612ccc: Download complete 2024-06-26T07:01:54.5420075Z 25b1f081a649: Verifying Checksum 2024-06-26T07:01:54.5420698Z 25b1f081a649: Download complete 2024-06-26T07:01:55.0184935Z ebd46885b2bf: Verifying Checksum 2024-06-26T07:01:55.0185566Z ebd46885b2bf: Download complete 2024-06-26T07:01:55.1018894Z 380d4036734b: Verifying Checksum 2024-06-26T07:01:55.1019559Z 380d4036734b: Download complete 2024-06-26T07:01:55.1809282Z 0578257b9245: Download complete 2024-06-26T07:01:55.2565575Z f65b3a065edb: Verifying Checksum 2024-06-26T07:01:55.2566246Z f65b3a065edb: Download complete 2024-06-26T07:01:55.3414373Z cca983685e21: Verifying Checksum 2024-06-26T07:01:55.3415035Z cca983685e21: Download complete 2024-06-26T07:01:58.9407772Z 7ed32bc8e469: Verifying Checksum 2024-06-26T07:01:58.9408249Z 7ed32bc8e469: Download complete 2024-06-26T07:01:59.0100673Z 232f4ffc1ad6: Verifying Checksum 2024-06-26T07:01:59.0101248Z 232f4ffc1ad6: Download complete 2024-06-26T07:01:59.0771952Z 58ba66ea2bf3: Verifying Checksum 2024-06-26T07:01:59.0772378Z 58ba66ea2bf3: Download complete 2024-06-26T07:01:59.1395279Z 8726153a0911: Download complete 2024-06-26T07:01:59.2647939Z a8c1e85b5e14: Pull complete 2024-06-26T07:01:59.3144322Z 663a07bccf9e: Verifying Checksum 2024-06-26T07:01:59.3144744Z 663a07bccf9e: Download complete 2024-06-26T07:01:59.4423097Z 7312121199e8: Verifying Checksum 2024-06-26T07:01:59.4423527Z 7312121199e8: Download complete 2024-06-26T07:01:59.4710498Z a41a8d1c11c8: Pull complete 2024-06-26T07:01:59.6510824Z 287ac2ec71fb: Download complete 2024-06-26T07:01:59.7015582Z 0c1227890755: Pull complete 2024-06-26T07:01:59.7153293Z d996b9174ec3: Verifying Checksum 2024-06-26T07:01:59.7153710Z d996b9174ec3: Download complete 2024-06-26T07:01:59.7844248Z f5e21892ce12: Download complete 2024-06-26T07:01:59.8498265Z 11bfa431de03: Verifying Checksum 2024-06-26T07:01:59.8498689Z 11bfa431de03: Download complete 2024-06-26T07:01:59.9305458Z d8d1234baab3: Pull complete 2024-06-26T07:01:59.9681821Z 2ff96c5eb373: Verifying Checksum 2024-06-26T07:01:59.9682407Z 2ff96c5eb373: Download complete 2024-06-26T07:02:00.0339458Z 9f78da6bf632: Verifying Checksum 2024-06-26T07:02:00.0339898Z 9f78da6bf632: Download complete 2024-06-26T07:02:00.6301306Z 299e957bfa95: Verifying Checksum 2024-06-26T07:02:00.6301922Z 299e957bfa95: Download complete 2024-06-26T07:02:00.7165979Z 809fbdc1d051: Verifying Checksum 2024-06-26T07:02:00.7166573Z 809fbdc1d051: Download complete 2024-06-26T07:02:05.7004007Z face7ff0cce6: Verifying Checksum 2024-06-26T07:02:05.7004961Z face7ff0cce6: Download complete 2024-06-26T07:02:05.7781293Z 9efc8d031b86: Verifying Checksum 2024-06-26T07:02:05.7781852Z 9efc8d031b86: Download complete 2024-06-26T07:02:05.8550317Z 338f5515764f: Verifying Checksum 2024-06-26T07:02:05.8550865Z 338f5515764f: Download complete 2024-06-26T07:02:06.3725925Z d2cda4294dfa: Verifying Checksum 2024-06-26T07:02:06.3726543Z d2cda4294dfa: Download complete 2024-06-26T07:02:06.4523660Z 66b50dcc6ac6: Verifying Checksum 2024-06-26T07:02:06.4524177Z 66b50dcc6ac6: Download complete 2024-06-26T07:02:14.4407705Z 349f4b6c9f16: Verifying Checksum 2024-06-26T07:02:14.4408277Z 349f4b6c9f16: Download complete 2024-06-26T07:02:14.5129916Z 57ac44bb7f06: Verifying Checksum 2024-06-26T07:02:14.5130379Z 57ac44bb7f06: Download complete 2024-06-26T07:02:35.6468756Z 7ed32bc8e469: Pull complete 2024-06-26T07:02:35.9322381Z ec1e7978c1fe: Pull complete 2024-06-26T07:02:36.0567932Z 53dcf3fc784e: Pull complete 2024-06-26T07:02:43.3710025Z 0935fd0635f2: Pull complete 2024-06-26T07:02:43.5654113Z 244ea295e7fb: Pull complete 2024-06-26T07:02:43.7818257Z 8872055c179e: Pull complete 2024-06-26T07:02:43.9801515Z b8e45ccefd13: Pull complete 2024-06-26T07:02:46.3488358Z 1d8219281e67: Pull complete 2024-06-26T07:02:46.5504512Z 55dc2e88315a: Pull complete 2024-06-26T07:02:46.7494568Z 17781119b281: Pull complete 2024-06-26T07:02:46.9485230Z c649f5998fff: Pull complete 2024-06-26T07:03:36.2317452Z face7ff0cce6: Pull complete 2024-06-26T07:03:36.2666556Z e775bf742328: Pull complete 2024-06-26T07:03:36.3296424Z 7b94382da9db: Pull complete 2024-06-26T07:03:36.3696017Z e5fa11ef6656: Pull complete 2024-06-26T07:03:36.4144021Z 9835eabf09ba: Pull complete 2024-06-26T07:03:36.4492688Z b76dc9b5ad60: Pull complete 2024-06-26T07:03:39.9436347Z 7ecf8de715a8: Pull complete 2024-06-26T07:03:40.1476354Z 23da2d4005d3: Pull complete 2024-06-26T07:03:40.3500565Z 813dde637cf5: Pull complete 2024-06-26T07:03:40.5648410Z ad5f73a0c0a6: Pull complete 2024-06-26T07:03:40.7674355Z 1096382abe26: Pull complete 2024-06-26T07:03:40.9817524Z dde82f18c5d7: Pull complete 2024-06-26T07:03:49.2105622Z e5d823764d67: Pull complete 2024-06-26T07:03:49.4044106Z 2c6331d88964: Pull complete 2024-06-26T07:03:49.5909877Z 1faf867e9689: Pull complete 2024-06-26T07:03:50.5847771Z cbd7992bcbce: Pull complete 2024-06-26T07:03:50.7952326Z 70b6526a3ba1: Pull complete 2024-06-26T07:03:50.9741570Z 2d8c70ea270e: Pull complete 2024-06-26T07:03:51.4608895Z e27f286cee79: Pull complete 2024-06-26T07:03:51.5823777Z de70e49a2239: Pull complete 2024-06-26T07:03:51.8777034Z 5c0d09330756: Pull complete 2024-06-26T07:03:52.0333876Z cd6bcbb42af6: Pull complete 2024-06-26T07:03:52.2356580Z 4f4fb700ef54: Pull complete 2024-06-26T07:03:52.4603092Z 7c13ee612ccc: Pull complete 2024-06-26T07:03:52.6622838Z 25b1f081a649: Pull complete 2024-06-26T07:03:54.8376967Z ebd46885b2bf: Pull complete 2024-06-26T07:03:55.0367384Z 380d4036734b: Pull complete 2024-06-26T07:03:55.2373893Z 0578257b9245: Pull complete 2024-06-26T07:03:55.6686615Z f65b3a065edb: Pull complete 2024-06-26T07:03:55.8711045Z cca983685e21: Pull complete 2024-06-26T07:04:29.8028713Z 349f4b6c9f16: Pull complete 2024-06-26T07:04:29.9560564Z 232f4ffc1ad6: Pull complete 2024-06-26T07:04:30.1697834Z 58ba66ea2bf3: Pull complete 2024-06-26T07:04:30.6028954Z 8726153a0911: Pull complete 2024-06-26T07:04:30.9091621Z ac34288e6c8e: Pull complete 2024-06-26T07:04:31.0656653Z 663a07bccf9e: Pull complete 2024-06-26T07:04:31.4112871Z 7312121199e8: Pull complete 2024-06-26T07:04:31.6119640Z 287ac2ec71fb: Pull complete 2024-06-26T07:04:31.8281425Z d996b9174ec3: Pull complete 2024-06-26T07:04:32.0386937Z f5e21892ce12: Pull complete 2024-06-26T07:04:32.2413161Z 11bfa431de03: Pull complete 2024-06-26T07:04:33.2916433Z 2ff96c5eb373: Pull complete 2024-06-26T07:04:33.5249755Z 9f78da6bf632: Pull complete 2024-06-26T07:04:35.4884379Z 299e957bfa95: Pull complete 2024-06-26T07:04:35.5351272Z 809fbdc1d051: Pull complete 2024-06-26T07:04:48.7969641Z 57ac44bb7f06: Pull complete 2024-06-26T07:04:49.0196287Z 9efc8d031b86: Pull complete 2024-06-26T07:04:49.2208472Z 338f5515764f: Pull complete 2024-06-26T07:04:50.1045536Z d2cda4294dfa: Pull complete 2024-06-26T07:04:50.2946199Z 66b50dcc6ac6: Pull complete 2024-06-26T07:04:51.3545227Z Digest: sha256:039956dde43d3becfe71079682edc6c7db7202c8395dd9e2a64a89dc944300d0 2024-06-26T07:04:51.4007185Z Status: Downloaded newer image for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:04:51.4232133Z 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:04:51.4275174Z ##[group]Run echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT" 2024-06-26T07:04:51.4276221Z echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT" 2024-06-26T07:04:51.4284219Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:04:51.4284969Z env: 2024-06-26T07:04:51.4285239Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:04:51.4285578Z ##[endgroup] 2024-06-26T07:04:51.4443140Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2024-06-26T07:04:51.4443689Z with: 2024-06-26T07:04:51.4443976Z driver-version: 550.54.15 2024-06-26T07:04:51.4444323Z env: 2024-06-26T07:04:51.4444809Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:04:51.4445219Z ##[endgroup] 2024-06-26T07:04:51.4580665Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2024-06-26T07:04:51.4581226Z with: 2024-06-26T07:04:51.4581492Z timeout_minutes: 10 2024-06-26T07:04:51.4581818Z max_attempts: 3 2024-06-26T07:04:51.4614436Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" 2024-06-26T07:04:51.4647090Z retry_wait_seconds: 10 2024-06-26T07:04:51.4647460Z polling_interval_seconds: 1 2024-06-26T07:04:51.4647834Z warning_on_retry: true 2024-06-26T07:04:51.4648184Z continue_on_error: false 2024-06-26T07:04:51.4648522Z env: 2024-06-26T07:04:51.4648789Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:04:51.4649151Z DRIVER_VERSION: 550.54.15 2024-06-26T07:04:51.4649498Z ##[endgroup] 2024-06-26T07:04:51.5176522Z == Installing nvidia driver NVIDIA-Linux-x86_64-550.54.15.run == 2024-06-26T07:04:51.5177536Z + pre_install_nvidia_driver_amzn2 2024-06-26T07:04:51.5178154Z + sudo yum remove -y nvidia-driver-latest-dkms 2024-06-26T07:04:51.8214407Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-06-26T07:04:51.8600268Z No Match for argument: nvidia-driver-latest-dkms 2024-06-26T07:04:51.8838637Z No Packages marked for removal 2024-06-26T07:04:51.8959558Z + install_nvidia_driver_common 2024-06-26T07:04:51.8962344Z + echo 'Before installing NVIDIA driver' 2024-06-26T07:04:51.8964279Z + lspci 2024-06-26T07:04:51.8964908Z Before installing NVIDIA driver 2024-06-26T07:04:51.9043425Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2024-06-26T07:04:51.9044220Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2024-06-26T07:04:51.9045484Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2024-06-26T07:04:51.9046308Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2024-06-26T07:04:51.9047243Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061 2024-06-26T07:04:51.9048457Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2024-06-26T07:04:51.9049225Z 00:1e.0 3D controller: NVIDIA Corporation Device 2237 (rev a1) 2024-06-26T07:04:51.9050097Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2024-06-26T07:04:51.9050702Z + lsmod 2024-06-26T07:04:51.9057037Z Module Size Used by 2024-06-26T07:04:51.9057941Z ib_core 266240 0 2024-06-26T07:04:51.9058528Z nvidia_modeset 1339392 0 2024-06-26T07:04:51.9058939Z veth 16384 0 2024-06-26T07:04:51.9059308Z nvidia_uvm 4599808 0 2024-06-26T07:04:51.9059759Z nvidia 53989376 2 nvidia_uvm,nvidia_modeset 2024-06-26T07:04:51.9060387Z drm 421888 1 nvidia 2024-06-26T07:04:51.9060824Z i2c_core 77824 2 nvidia,drm 2024-06-26T07:04:51.9061287Z backlight 16384 1 nvidia_modeset 2024-06-26T07:04:51.9061734Z xt_conntrack 16384 1 2024-06-26T07:04:51.9062119Z ipt_MASQUERADE 16384 1 2024-06-26T07:04:51.9062552Z nf_nat_masquerade_ipv4 16384 1 ipt_MASQUERADE 2024-06-26T07:04:51.9063027Z nf_conntrack_netlink 49152 0 2024-06-26T07:04:51.9063485Z nfnetlink 16384 2 nf_conntrack_netlink 2024-06-26T07:04:51.9063948Z xfrm_user 45056 1 2024-06-26T07:04:51.9064344Z xfrm_algo 16384 1 xfrm_user 2024-06-26T07:04:51.9064842Z iptable_nat 16384 1 2024-06-26T07:04:51.9065580Z nf_conntrack_ipv4 16384 3 2024-06-26T07:04:51.9066236Z nf_defrag_ipv4 16384 1 nf_conntrack_ipv4 2024-06-26T07:04:51.9066929Z nf_nat_ipv4 16384 1 iptable_nat 2024-06-26T07:04:51.9067609Z nf_nat 32768 2 nf_nat_masquerade_ipv4,nf_nat_ipv4 2024-06-26T07:04:51.9068786Z nf_conntrack 155648 7 xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4,nf_nat,ipt_MASQUERADE,nf_nat_ipv4,nf_conntrack_netlink 2024-06-26T07:04:51.9069943Z xt_addrtype 16384 2 2024-06-26T07:04:51.9070357Z iptable_filter 16384 1 2024-06-26T07:04:51.9070833Z br_netfilter 24576 0 2024-06-26T07:04:51.9071449Z bridge 172032 1 br_netfilter 2024-06-26T07:04:51.9072102Z stp 16384 1 bridge 2024-06-26T07:04:51.9072623Z llc 16384 2 bridge,stp 2024-06-26T07:04:51.9073208Z overlay 86016 0 2024-06-26T07:04:51.9073751Z sunrpc 393216 1 2024-06-26T07:04:51.9074149Z dm_mirror 28672 0 2024-06-26T07:04:51.9074568Z dm_region_hash 20480 1 dm_mirror 2024-06-26T07:04:51.9075081Z dm_log 20480 2 dm_region_hash,dm_mirror 2024-06-26T07:04:51.9075611Z dm_mod 143360 2 dm_log,dm_mirror 2024-06-26T07:04:51.9076087Z dax 69632 1 dm_mod 2024-06-26T07:04:51.9076503Z crc32_pclmul 16384 0 2024-06-26T07:04:51.9076890Z ghash_clmulni_intel 16384 0 2024-06-26T07:04:51.9077285Z pcbc 16384 0 2024-06-26T07:04:51.9077666Z aesni_intel 188416 0 2024-06-26T07:04:51.9078052Z aes_x86_64 20480 1 aesni_intel 2024-06-26T07:04:51.9078500Z crypto_simd 16384 1 aesni_intel 2024-06-26T07:04:51.9078963Z glue_helper 16384 1 aesni_intel 2024-06-26T07:04:51.9079383Z mousedev 24576 0 2024-06-26T07:04:51.9079965Z cryptd 28672 3 crypto_simd,ghash_clmulni_intel,aesni_intel 2024-06-26T07:04:51.9080659Z evdev 20480 3 2024-06-26T07:04:51.9081033Z psmouse 32768 0 2024-06-26T07:04:51.9081401Z button 16384 0 2024-06-26T07:04:51.9081814Z ena 139264 0 2024-06-26T07:04:51.9082180Z ptp 20480 1 ena 2024-06-26T07:04:51.9082574Z pps_core 20480 1 ptp 2024-06-26T07:04:51.9082967Z crc32c_intel 24576 0 2024-06-26T07:04:51.9083333Z autofs4 49152 2 2024-06-26T07:04:51.9083683Z + modinfo nvidia 2024-06-26T07:04:51.9084390Z filename: /lib/modules/4.14.336-257.562.amzn2.x86_64/kernel/drivers/video/nvidia.ko 2024-06-26T07:04:51.9085435Z alias: char-major-195-* 2024-06-26T07:04:51.9085813Z version: 550.54.15 2024-06-26T07:04:51.9086159Z supported: external 2024-06-26T07:04:51.9086502Z license: NVIDIA 2024-06-26T07:04:51.9086871Z firmware: nvidia/550.54.15/gsp_tu10x.bin 2024-06-26T07:04:51.9087355Z firmware: nvidia/550.54.15/gsp_ga10x.bin 2024-06-26T07:04:51.9087974Z srcversion: 833721318DA517F0C2FEC97 2024-06-26T07:04:51.9088431Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2024-06-26T07:04:51.9088929Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2024-06-26T07:04:51.9089417Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2024-06-26T07:04:51.9089968Z depends: i2c-core,drm 2024-06-26T07:04:51.9090331Z retpoline: Y 2024-06-26T07:04:51.9090648Z name: nvidia 2024-06-26T07:04:51.9091192Z vermagic: 4.14.336-257.562.amzn2.x86_64 SMP mod_unload modversions 2024-06-26T07:04:51.9091855Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2024-06-26T07:04:51.9092505Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2024-06-26T07:04:51.9093092Z parm: NVreg_ResmanDebugLevel:int 2024-06-26T07:04:51.9093544Z parm: NVreg_RmLogonRC:int 2024-06-26T07:04:51.9093979Z parm: NVreg_ModifyDeviceFiles:int 2024-06-26T07:04:51.9094435Z parm: NVreg_DeviceFileUID:int 2024-06-26T07:04:51.9094879Z parm: NVreg_DeviceFileGID:int 2024-06-26T07:04:51.9095317Z parm: NVreg_DeviceFileMode:int 2024-06-26T07:04:51.9095839Z parm: NVreg_InitializeSystemMemoryAllocations:int 2024-06-26T07:04:51.9096405Z parm: NVreg_UsePageAttributeTable:int 2024-06-26T07:04:51.9096892Z parm: NVreg_EnablePCIeGen3:int 2024-06-26T07:04:51.9097318Z parm: NVreg_EnableMSI:int 2024-06-26T07:04:51.9097736Z parm: NVreg_TCEBypassMode:int 2024-06-26T07:04:51.9098194Z parm: NVreg_EnableStreamMemOPs:int 2024-06-26T07:04:51.9098718Z parm: NVreg_RestrictProfilingToAdminUsers:int 2024-06-26T07:04:51.9099304Z parm: NVreg_PreserveVideoMemoryAllocations:int 2024-06-26T07:04:51.9099866Z parm: NVreg_EnableS0ixPowerManagement:int 2024-06-26T07:04:51.9100471Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2024-06-26T07:04:51.9101071Z parm: NVreg_DynamicPowerManagement:int 2024-06-26T07:04:51.9101692Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2024-06-26T07:04:51.9102289Z parm: NVreg_EnableGpuFirmware:int 2024-06-26T07:04:51.9102774Z parm: NVreg_EnableGpuFirmwareLogs:int 2024-06-26T07:04:51.9103320Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2024-06-26T07:04:51.9103870Z parm: NVreg_EnableUserNUMAManagement:int 2024-06-26T07:04:51.9104365Z parm: NVreg_MemoryPoolSize:int 2024-06-26T07:04:51.9104832Z parm: NVreg_KMallocHeapMaxSize:int 2024-06-26T07:04:51.9105317Z parm: NVreg_VMallocHeapMaxSize:int 2024-06-26T07:04:51.9105900Z parm: NVreg_IgnoreMMIOCheck:int 2024-06-26T07:04:51.9106349Z parm: NVreg_NvLinkDisable:int 2024-06-26T07:04:51.9106853Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2024-06-26T07:04:51.9107372Z parm: NVreg_RegisterPCIDriver:int 2024-06-26T07:04:51.9107851Z parm: NVreg_EnableResizableBar:int 2024-06-26T07:04:51.9108344Z parm: NVreg_EnableDbgBreakpoint:int 2024-06-26T07:04:51.9108835Z parm: NVreg_EnableNonblockingOpen:int 2024-06-26T07:04:51.9109324Z parm: NVreg_RegistryDwords:charp 2024-06-26T07:04:51.9109821Z parm: NVreg_RegistryDwordsPerDevice:charp 2024-06-26T07:04:51.9110309Z parm: NVreg_RmMsg:charp 2024-06-26T07:04:51.9110721Z parm: NVreg_GpuBlacklist:charp 2024-06-26T07:04:51.9111195Z parm: NVreg_TemporaryFilePath:charp 2024-06-26T07:04:51.9111657Z parm: NVreg_ExcludedGpus:charp 2024-06-26T07:04:51.9112116Z parm: NVreg_DmaRemapPeerMmio:int 2024-06-26T07:04:51.9112598Z parm: NVreg_RmNvlinkBandwidth:charp 2024-06-26T07:04:51.9113067Z parm: NVreg_ImexChannelCount:int 2024-06-26T07:04:51.9113520Z parm: rm_firmware_active:charp 2024-06-26T07:04:51.9113931Z + HAS_NVIDIA_DRIVER=0 2024-06-26T07:04:51.9114309Z ++ command -v nvidia-smi 2024-06-26T07:04:51.9114703Z + '[' -x /usr/bin/nvidia-smi ']' 2024-06-26T07:04:51.9115175Z + set +e 2024-06-26T07:04:51.9115670Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2024-06-26T07:04:53.9078555Z + INSTALLED_DRIVER_VERSION=550.54.15 2024-06-26T07:04:53.9079074Z + NVIDIA_SMI_STATUS=0 2024-06-26T07:04:53.9079667Z + '[' 0 -ne 0 ']' 2024-06-26T07:04:53.9080581Z + '[' 550.54.15 '!=' 550.54.15 ']' 2024-06-26T07:04:53.9081108Z + HAS_NVIDIA_DRIVER=1 2024-06-26T07:04:53.9082036Z + echo 'NVIDIA driver (550.54.15) has already been installed. Skipping NVIDIA driver installation' 2024-06-26T07:04:53.9082764Z + set -e 2024-06-26T07:04:53.9083065Z + '[' 1 -eq 0 ']' 2024-06-26T07:04:53.9083408Z + post_install_nvidia_driver_common 2024-06-26T07:04:53.9084079Z NVIDIA driver (550.54.15) has already been installed. Skipping NVIDIA driver installation 2024-06-26T07:04:53.9085044Z + sudo modprobe nvidia 2024-06-26T07:04:53.9172636Z + echo 'After installing NVIDIA driver' 2024-06-26T07:04:53.9173126Z + lspci 2024-06-26T07:04:53.9173428Z After installing NVIDIA driver 2024-06-26T07:04:53.9254106Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2024-06-26T07:04:53.9255065Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2024-06-26T07:04:53.9256203Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2024-06-26T07:04:53.9257049Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2024-06-26T07:04:53.9257864Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061 2024-06-26T07:04:53.9258695Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2024-06-26T07:04:53.9259495Z 00:1e.0 3D controller: NVIDIA Corporation Device 2237 (rev a1) 2024-06-26T07:04:53.9260326Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2024-06-26T07:04:53.9260930Z + lsmod 2024-06-26T07:04:53.9268305Z Module Size Used by 2024-06-26T07:04:53.9268902Z ib_core 266240 0 2024-06-26T07:04:53.9269402Z nvidia_modeset 1339392 0 2024-06-26T07:04:53.9269933Z veth 16384 0 2024-06-26T07:04:53.9270367Z nvidia_uvm 4599808 0 2024-06-26T07:04:53.9270916Z nvidia 53989376 2 nvidia_uvm,nvidia_modeset 2024-06-26T07:04:53.9271540Z drm 421888 1 nvidia 2024-06-26T07:04:53.9272097Z i2c_core 77824 2 nvidia,drm 2024-06-26T07:04:53.9272565Z backlight 16384 1 nvidia_modeset 2024-06-26T07:04:53.9273011Z xt_conntrack 16384 1 2024-06-26T07:04:53.9273397Z ipt_MASQUERADE 16384 1 2024-06-26T07:04:53.9273827Z nf_nat_masquerade_ipv4 16384 1 ipt_MASQUERADE 2024-06-26T07:04:53.9274296Z nf_conntrack_netlink 49152 0 2024-06-26T07:04:53.9274742Z nfnetlink 16384 2 nf_conntrack_netlink 2024-06-26T07:04:53.9275198Z xfrm_user 45056 1 2024-06-26T07:04:53.9275586Z xfrm_algo 16384 1 xfrm_user 2024-06-26T07:04:53.9276003Z iptable_nat 16384 1 2024-06-26T07:04:53.9276374Z nf_conntrack_ipv4 16384 3 2024-06-26T07:04:53.9276959Z nf_defrag_ipv4 16384 1 nf_conntrack_ipv4 2024-06-26T07:04:53.9277602Z nf_nat_ipv4 16384 1 iptable_nat 2024-06-26T07:04:53.9278293Z nf_nat 32768 2 nf_nat_masquerade_ipv4,nf_nat_ipv4 2024-06-26T07:04:53.9279579Z nf_conntrack 155648 7 xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv4,nf_nat,ipt_MASQUERADE,nf_nat_ipv4,nf_conntrack_netlink 2024-06-26T07:04:53.9280717Z xt_addrtype 16384 2 2024-06-26T07:04:53.9281269Z iptable_filter 16384 1 2024-06-26T07:04:53.9281792Z br_netfilter 24576 0 2024-06-26T07:04:53.9282284Z bridge 172032 1 br_netfilter 2024-06-26T07:04:53.9282884Z stp 16384 1 bridge 2024-06-26T07:04:53.9283468Z llc 16384 2 bridge,stp 2024-06-26T07:04:53.9283926Z overlay 86016 0 2024-06-26T07:04:53.9284296Z sunrpc 393216 1 2024-06-26T07:04:53.9285054Z dm_mirror 28672 0 2024-06-26T07:04:53.9285875Z dm_region_hash 20480 1 dm_mirror 2024-06-26T07:04:53.9286497Z dm_log 20480 2 dm_region_hash,dm_mirror 2024-06-26T07:04:53.9287016Z dm_mod 143360 2 dm_log,dm_mirror 2024-06-26T07:04:53.9287555Z dax 69632 1 dm_mod 2024-06-26T07:04:53.9288057Z crc32_pclmul 16384 0 2024-06-26T07:04:53.9288444Z ghash_clmulni_intel 16384 0 2024-06-26T07:04:53.9288825Z pcbc 16384 0 2024-06-26T07:04:53.9289185Z aesni_intel 188416 0 2024-06-26T07:04:53.9289575Z aes_x86_64 20480 1 aesni_intel 2024-06-26T07:04:53.9290031Z crypto_simd 16384 1 aesni_intel 2024-06-26T07:04:53.9290477Z glue_helper 16384 1 aesni_intel 2024-06-26T07:04:53.9290904Z mousedev 24576 0 2024-06-26T07:04:53.9291408Z cryptd 28672 3 crypto_simd,ghash_clmulni_intel,aesni_intel 2024-06-26T07:04:53.9291946Z evdev 20480 3 2024-06-26T07:04:53.9292316Z psmouse 32768 0 2024-06-26T07:04:53.9292693Z button 16384 0 2024-06-26T07:04:53.9293049Z ena 139264 0 2024-06-26T07:04:53.9293413Z ptp 20480 1 ena 2024-06-26T07:04:53.9293808Z pps_core 20480 1 ptp 2024-06-26T07:04:53.9294202Z crc32c_intel 24576 0 2024-06-26T07:04:53.9294575Z autofs4 49152 2 2024-06-26T07:04:53.9294963Z + modinfo nvidia 2024-06-26T07:04:53.9295624Z filename: /lib/modules/4.14.336-257.562.amzn2.x86_64/kernel/drivers/video/nvidia.ko 2024-06-26T07:04:53.9296328Z alias: char-major-195-* 2024-06-26T07:04:53.9296709Z version: 550.54.15 2024-06-26T07:04:53.9297047Z supported: external 2024-06-26T07:04:53.9297379Z license: NVIDIA 2024-06-26T07:04:53.9297760Z firmware: nvidia/550.54.15/gsp_tu10x.bin 2024-06-26T07:04:53.9298242Z firmware: nvidia/550.54.15/gsp_ga10x.bin 2024-06-26T07:04:53.9298695Z srcversion: 833721318DA517F0C2FEC97 2024-06-26T07:04:53.9299169Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2024-06-26T07:04:53.9299668Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2024-06-26T07:04:53.9300155Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2024-06-26T07:04:53.9300648Z depends: i2c-core,drm 2024-06-26T07:04:53.9301004Z retpoline: Y 2024-06-26T07:04:53.9301308Z name: nvidia 2024-06-26T07:04:53.9301856Z vermagic: 4.14.336-257.562.amzn2.x86_64 SMP mod_unload modversions 2024-06-26T07:04:53.9302522Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2024-06-26T07:04:53.9303168Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2024-06-26T07:04:53.9303772Z parm: NVreg_ResmanDebugLevel:int 2024-06-26T07:04:53.9304221Z parm: NVreg_RmLogonRC:int 2024-06-26T07:04:53.9304648Z parm: NVreg_ModifyDeviceFiles:int 2024-06-26T07:04:53.9305110Z parm: NVreg_DeviceFileUID:int 2024-06-26T07:04:53.9305694Z parm: NVreg_DeviceFileGID:int 2024-06-26T07:04:53.9306133Z parm: NVreg_DeviceFileMode:int 2024-06-26T07:04:53.9306666Z parm: NVreg_InitializeSystemMemoryAllocations:int 2024-06-26T07:04:53.9307232Z parm: NVreg_UsePageAttributeTable:int 2024-06-26T07:04:53.9307708Z parm: NVreg_EnablePCIeGen3:int 2024-06-26T07:04:53.9308147Z parm: NVreg_EnableMSI:int 2024-06-26T07:04:53.9308563Z parm: NVreg_TCEBypassMode:int 2024-06-26T07:04:53.9309012Z parm: NVreg_EnableStreamMemOPs:int 2024-06-26T07:04:53.9309545Z parm: NVreg_RestrictProfilingToAdminUsers:int 2024-06-26T07:04:53.9310133Z parm: NVreg_PreserveVideoMemoryAllocations:int 2024-06-26T07:04:53.9310701Z parm: NVreg_EnableS0ixPowerManagement:int 2024-06-26T07:04:53.9311316Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2024-06-26T07:04:53.9311919Z parm: NVreg_DynamicPowerManagement:int 2024-06-26T07:04:53.9312537Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2024-06-26T07:04:53.9313244Z parm: NVreg_EnableGpuFirmware:int 2024-06-26T07:04:53.9313741Z parm: NVreg_EnableGpuFirmwareLogs:int 2024-06-26T07:04:53.9314279Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2024-06-26T07:04:53.9314828Z parm: NVreg_EnableUserNUMAManagement:int 2024-06-26T07:04:53.9315385Z parm: NVreg_MemoryPoolSize:int 2024-06-26T07:04:53.9315856Z parm: NVreg_KMallocHeapMaxSize:int 2024-06-26T07:04:53.9316338Z parm: NVreg_VMallocHeapMaxSize:int 2024-06-26T07:04:53.9316809Z parm: NVreg_IgnoreMMIOCheck:int 2024-06-26T07:04:53.9317260Z parm: NVreg_NvLinkDisable:int 2024-06-26T07:04:53.9317758Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2024-06-26T07:04:53.9318287Z parm: NVreg_RegisterPCIDriver:int 2024-06-26T07:04:53.9318766Z parm: NVreg_EnableResizableBar:int 2024-06-26T07:04:53.9319248Z parm: NVreg_EnableDbgBreakpoint:int 2024-06-26T07:04:53.9319746Z parm: NVreg_EnableNonblockingOpen:int 2024-06-26T07:04:53.9320241Z parm: NVreg_RegistryDwords:charp 2024-06-26T07:04:53.9320735Z parm: NVreg_RegistryDwordsPerDevice:charp 2024-06-26T07:04:53.9340047Z parm: NVreg_RmMsg:charp 2024-06-26T07:04:53.9340659Z parm: NVreg_GpuBlacklist:charp 2024-06-26T07:04:53.9341158Z parm: NVreg_TemporaryFilePath:charp 2024-06-26T07:04:53.9341635Z parm: NVreg_ExcludedGpus:charp 2024-06-26T07:04:53.9342098Z parm: NVreg_DmaRemapPeerMmio:int 2024-06-26T07:04:53.9342585Z parm: NVreg_RmNvlinkBandwidth:charp 2024-06-26T07:04:53.9343062Z parm: NVreg_ImexChannelCount:int 2024-06-26T07:04:53.9343528Z parm: rm_firmware_active:charp 2024-06-26T07:04:53.9343945Z + set +e 2024-06-26T07:04:53.9344266Z + nvidia-smi 2024-06-26T07:04:55.4922557Z Wed Jun 26 07:04:55 2024 2024-06-26T07:04:55.4923471Z +-----------------------------------------------------------------------------------------+ 2024-06-26T07:04:55.4924350Z | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | 2024-06-26T07:04:55.4925400Z |-----------------------------------------+------------------------+----------------------+ 2024-06-26T07:04:55.4926220Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2024-06-26T07:04:55.4927118Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2024-06-26T07:04:55.4927816Z | | | MIG M. | 2024-06-26T07:04:55.4928365Z |=========================================+========================+======================| 2024-06-26T07:04:55.5029805Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2024-06-26T07:04:55.5030622Z | 0% 34C P0 61W / 300W | 0MiB / 23028MiB | 0% Default | 2024-06-26T07:04:55.5031246Z | | | N/A | 2024-06-26T07:04:55.5032031Z +-----------------------------------------+------------------------+----------------------+ 2024-06-26T07:04:55.5032615Z 2024-06-26T07:04:55.5033274Z +-----------------------------------------------------------------------------------------+ 2024-06-26T07:04:55.5033897Z | Processes: | 2024-06-26T07:04:55.5034594Z | GPU GI CI PID Type Process name GPU Memory | 2024-06-26T07:04:55.5035287Z | ID ID Usage | 2024-06-26T07:04:55.5035843Z |=========================================================================================| 2024-06-26T07:04:55.5036483Z | No running processes found | 2024-06-26T07:04:55.5037528Z +-----------------------------------------------------------------------------------------+ 2024-06-26T07:04:56.1245249Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2024-06-26T07:04:57.6877974Z NVIDIA A10G 2024-06-26T07:04:58.1300446Z + NVIDIA_SMI_STATUS=0 2024-06-26T07:04:58.1301314Z + '[' 0 -eq 0 ']' 2024-06-26T07:04:58.1301728Z + echo 'INFO: Ignoring allowed status 0' 2024-06-26T07:04:58.1302147Z + set -e 2024-06-26T07:04:58.1302441Z INFO: Ignoring allowed status 0 2024-06-26T07:04:58.1304470Z == Installing nvidia container toolkit for amzn2 == 2024-06-26T07:04:58.1307249Z + sudo yum install -y yum-utils 2024-06-26T07:04:58.4339513Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-06-26T07:05:00.0892948Z Package yum-utils-1.1.31-46.amzn2.0.1.noarch already installed and latest version 2024-06-26T07:05:00.0893669Z Nothing to do 2024-06-26T07:05:00.2385650Z + [[ amzn2 == \a\m\z\n\2\0\2\3 ]] 2024-06-26T07:05:00.2386856Z + YUM_REPO_URL=https://nvidia.github.io/nvidia-docker/amzn2/nvidia-docker.repo 2024-06-26T07:05:00.2388303Z + sudo yum-config-manager --add-repo https://nvidia.github.io/nvidia-docker/amzn2/nvidia-docker.repo 2024-06-26T07:05:00.5785569Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-06-26T07:05:00.6017775Z adding repo from: https://nvidia.github.io/nvidia-docker/amzn2/nvidia-docker.repo 2024-06-26T07:05:00.6019079Z grabbing file https://nvidia.github.io/nvidia-docker/amzn2/nvidia-docker.repo to /etc/yum.repos.d/nvidia-docker.repo 2024-06-26T07:05:00.6020044Z repo saved to /etc/yum.repos.d/nvidia-docker.repo 2024-06-26T07:05:00.6133105Z + sudo yum install -y nvidia-docker2 2024-06-26T07:05:00.9115184Z Loaded plugins: extras_suggestions, langpacks, priorities, update-motd 2024-06-26T07:05:02.3972771Z Package nvidia-docker2-2.13.0-1.noarch already installed and latest version 2024-06-26T07:05:02.3973627Z Nothing to do 2024-06-26T07:05:02.5466507Z + sudo systemctl restart docker 2024-06-26T07:05:40.5673790Z Command completed after 1 attempt(s). 2024-06-26T07:05:40.5742120Z ##[group]Run python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84 2024-06-26T07:05:40.5742933Z python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84 2024-06-26T07:05:40.5743630Z python3 -m tools.stats.monitor > usage_log.txt 2>&1 & 2024-06-26T07:05:40.5744302Z echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}" 2024-06-26T07:05:40.5752189Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:05:40.5752689Z env: 2024-06-26T07:05:40.5752976Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:40.5753438Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:40.5753918Z ##[endgroup] 2024-06-26T07:05:40.7953144Z Defaulting to user installation because normal site-packages is not writeable 2024-06-26T07:05:40.8114126Z Requirement already satisfied: psutil==5.9.1 in /home/ec2-user/.local/lib/python3.7/site-packages (5.9.1) 2024-06-26T07:05:40.8213460Z Requirement already satisfied: nvidia-ml-py==11.525.84 in /home/ec2-user/.local/lib/python3.7/site-packages (11.525.84) 2024-06-26T07:05:40.9153717Z Prepare all required actions 2024-06-26T07:05:40.9154214Z Getting action download info 2024-06-26T07:05:41.0246191Z Download action repository 'seemethere/download-artifact-s3@v4' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2024-06-26T07:05:41.2148779Z Download action repository 'actions/download-artifact@v3' (SHA:9bc31d5ccc31df68ecc42ccf4149144866c47d8a) 2024-06-26T07:05:41.3352685Z ##[group]Run ./.github/actions/download-build-artifacts 2024-06-26T07:05:41.3353189Z with: 2024-06-26T07:05:41.3353548Z name: linux-focal-cuda12.1-py3.10-gcc9-sm86 2024-06-26T07:05:41.3354036Z s3-bucket: gha-artifacts 2024-06-26T07:05:41.3354371Z env: 2024-06-26T07:05:41.3354638Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:41.3355095Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:41.3355583Z ##[endgroup] 2024-06-26T07:05:41.3390015Z ##[group]Run seemethere/download-artifact-s3@v4 2024-06-26T07:05:41.3390617Z with: 2024-06-26T07:05:41.3390988Z name: linux-focal-cuda12.1-py3.10-gcc9-sm86 2024-06-26T07:05:41.3391455Z s3-bucket: gha-artifacts 2024-06-26T07:05:41.3391811Z region: us-east-1 2024-06-26T07:05:41.3392106Z env: 2024-06-26T07:05:41.3392378Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:41.3392832Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:41.3393320Z ##[endgroup] 2024-06-26T07:05:41.7610061Z (node:17955) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2024-06-26T07:05:41.7611566Z 2024-06-26T07:05:41.7612114Z Please migrate your code to use AWS SDK for JavaScript (v3). 2024-06-26T07:05:41.7613664Z For more information, check the migration guide at https://a.co/7PzMCcy 2024-06-26T07:05:41.7614660Z (Use `node --trace-warnings ...` to show where the warning was created) 2024-06-26T07:05:41.8250675Z Found 1 objects with prefix pytorch/pytorch/9673645538/linux-focal-cuda12.1-py3.10-gcc9-sm86/ 2024-06-26T07:05:41.8251778Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2024-06-26T07:05:49.3147420Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2024-06-26T07:05:49.3153145Z Artifact download has finished successfully 2024-06-26T07:05:49.3330907Z ##[group]Run unzip -o artifacts.zip 2024-06-26T07:05:49.3331361Z unzip -o artifacts.zip 2024-06-26T07:05:49.3338978Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:05:49.3339485Z env: 2024-06-26T07:05:49.3339766Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:49.3340217Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:49.3340701Z ##[endgroup] 2024-06-26T07:05:49.3368113Z Archive: artifacts.zip 2024-06-26T07:05:49.3369834Z creating: dist/ 2024-06-26T07:05:51.3198683Z inflating: dist/torch-2.5.0a0+gitb8c4c54-cp310-cp310-linux_x86_64.whl 2024-06-26T07:05:51.3200031Z creating: build/custom_test_artifacts/ 2024-06-26T07:05:51.3201213Z creating: build/custom_test_artifacts/custom-op-build/ 2024-06-26T07:05:51.3202630Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/ 2024-06-26T07:05:51.3204266Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/pkgRedirects/ 2024-06-26T07:05:51.3206509Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeConfigureLog.yaml 2024-06-26T07:05:51.3207533Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/ 2024-06-26T07:05:51.3208483Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeSystem.cmake 2024-06-26T07:05:51.3209611Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/ 2024-06-26T07:05:51.3210835Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/tmp/ 2024-06-26T07:05:51.3211963Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/CMakeCCompilerId.c 2024-06-26T07:05:51.3213401Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdC/a.out 2024-06-26T07:05:51.3214461Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/ 2024-06-26T07:05:51.3215685Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/tmp/ 2024-06-26T07:05:51.3217007Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/CMakeCXXCompilerId.cpp 2024-06-26T07:05:51.3218278Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCXX/a.out 2024-06-26T07:05:51.3219679Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_C.bin 2024-06-26T07:05:51.3221017Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeCCompiler.cmake 2024-06-26T07:05:51.3222199Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CXX.bin 2024-06-26T07:05:51.3223554Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeCXXCompiler.cmake 2024-06-26T07:05:51.3224619Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/ 2024-06-26T07:05:51.3225781Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/ 2024-06-26T07:05:51.3260282Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2024-06-26T07:05:51.3298240Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2024-06-26T07:05:51.3299760Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2024-06-26T07:05:51.3343130Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2024-06-26T07:05:51.3344921Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2024-06-26T07:05:51.3346851Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2024-06-26T07:05:51.3348595Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2024-06-26T07:05:51.3350345Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2024-06-26T07:05:51.3352093Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2024-06-26T07:05:51.3353704Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2024-06-26T07:05:51.3355181Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2024-06-26T07:05:51.3356648Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2024-06-26T07:05:51.3357965Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2024-06-26T07:05:51.3359225Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.reg.c 2024-06-26T07:05:51.3360463Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin 2024-06-26T07:05:51.3361727Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2024-06-26T07:05:51.3362942Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.o 2024-06-26T07:05:51.3364181Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/CMakeCUDACompilerId.cu 2024-06-26T07:05:51.3413118Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CompilerIdCUDA/a.out 2024-06-26T07:05:51.3471147Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CUDA.bin 2024-06-26T07:05:51.3472668Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.26.4/CMakeCUDACompiler.cmake 2024-06-26T07:05:51.3473971Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeScratch/ 2024-06-26T07:05:51.3475046Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeTmp/ 2024-06-26T07:05:51.3476139Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/cmake.check_cache 2024-06-26T07:05:51.3477291Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/ 2024-06-26T07:05:51.3478534Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.ts 2024-06-26T07:05:51.3479867Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.make 2024-06-26T07:05:51.3480977Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/depend.make 2024-06-26T07:05:51.3482021Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/link.txt 2024-06-26T07:05:51.3483104Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/cmake_clean.cmake 2024-06-26T07:05:51.3484189Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/build.make 2024-06-26T07:05:51.3485545Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/DependInfo.cmake 2024-06-26T07:05:51.3486698Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/flags.make 2024-06-26T07:05:51.3487765Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/progress.make 2024-06-26T07:05:51.3498696Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o.d 2024-06-26T07:05:51.3626509Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o 2024-06-26T07:05:51.3627833Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/ 2024-06-26T07:05:51.3629140Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.ts 2024-06-26T07:05:51.3630657Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.make 2024-06-26T07:05:51.3631998Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/depend.make 2024-06-26T07:05:51.3633137Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/link.txt 2024-06-26T07:05:51.3634296Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/cmake_clean.cmake 2024-06-26T07:05:51.3635468Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/build.make 2024-06-26T07:05:51.3636640Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/DependInfo.cmake 2024-06-26T07:05:51.3637815Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/flags.make 2024-06-26T07:05:51.3638953Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/progress.make 2024-06-26T07:05:51.3652068Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o.d 2024-06-26T07:05:51.3731890Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o 2024-06-26T07:05:51.3733354Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeDirectoryInformation.cmake 2024-06-26T07:05:51.3734688Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/TargetDirectories.txt 2024-06-26T07:05:51.3735883Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/progress.marks 2024-06-26T07:05:51.3736825Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile2 2024-06-26T07:05:51.3737943Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile.cmake 2024-06-26T07:05:51.3739045Z inflating: build/custom_test_artifacts/custom-op-build/detect_cuda_version.cc 2024-06-26T07:05:51.3739954Z inflating: build/custom_test_artifacts/custom-op-build/CMakeCache.txt 2024-06-26T07:05:51.3740735Z inflating: build/custom_test_artifacts/custom-op-build/Makefile 2024-06-26T07:05:51.3741534Z inflating: build/custom_test_artifacts/custom-op-build/cmake_install.cmake 2024-06-26T07:05:51.3847815Z inflating: build/custom_test_artifacts/custom-op-build/libcustom_ops.so 2024-06-26T07:05:51.3908502Z inflating: build/custom_test_artifacts/custom-op-build/test_custom_ops 2024-06-26T07:05:51.3909425Z creating: build/custom_test_artifacts/jit-hook-build/ 2024-06-26T07:05:51.3910131Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/ 2024-06-26T07:05:51.3910955Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/pkgRedirects/ 2024-06-26T07:05:51.3916242Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeConfigureLog.yaml 2024-06-26T07:05:51.3917313Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/ 2024-06-26T07:05:51.3918239Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeSystem.cmake 2024-06-26T07:05:51.3919448Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/ 2024-06-26T07:05:51.3920612Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/tmp/ 2024-06-26T07:05:51.3921834Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/CMakeCCompilerId.c 2024-06-26T07:05:51.3923048Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdC/a.out 2024-06-26T07:05:51.3924244Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/ 2024-06-26T07:05:51.3925606Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/tmp/ 2024-06-26T07:05:51.3926991Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/CMakeCXXCompilerId.cpp 2024-06-26T07:05:51.3928297Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCXX/a.out 2024-06-26T07:05:51.3929655Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_C.bin 2024-06-26T07:05:51.3931024Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeCCompiler.cmake 2024-06-26T07:05:51.3948672Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CXX.bin 2024-06-26T07:05:51.3950004Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeCXXCompiler.cmake 2024-06-26T07:05:51.3951080Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/ 2024-06-26T07:05:51.3952114Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/ 2024-06-26T07:05:51.3969953Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2024-06-26T07:05:51.4007869Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2024-06-26T07:05:51.4009363Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2024-06-26T07:05:51.4053759Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2024-06-26T07:05:51.4055528Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2024-06-26T07:05:51.4057456Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2024-06-26T07:05:51.4059182Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2024-06-26T07:05:51.4060931Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2024-06-26T07:05:51.4062603Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2024-06-26T07:05:51.4064183Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2024-06-26T07:05:51.4065776Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2024-06-26T07:05:51.4067381Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2024-06-26T07:05:51.4068715Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2024-06-26T07:05:51.4069993Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.reg.c 2024-06-26T07:05:51.4071224Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin 2024-06-26T07:05:51.4072466Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2024-06-26T07:05:51.4073668Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.o 2024-06-26T07:05:51.4074883Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/CMakeCUDACompilerId.cu 2024-06-26T07:05:51.4123510Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CompilerIdCUDA/a.out 2024-06-26T07:05:51.4182536Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CUDA.bin 2024-06-26T07:05:51.4184076Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.26.4/CMakeCUDACompiler.cmake 2024-06-26T07:05:51.4185469Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeScratch/ 2024-06-26T07:05:51.4186569Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeTmp/ 2024-06-26T07:05:51.4187669Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/cmake.check_cache 2024-06-26T07:05:51.4188889Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/ 2024-06-26T07:05:51.4190032Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.ts 2024-06-26T07:05:51.4191273Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.make 2024-06-26T07:05:51.4192487Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/depend.make 2024-06-26T07:05:51.4193601Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/link.txt 2024-06-26T07:05:51.4194760Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/cmake_clean.cmake 2024-06-26T07:05:51.4195919Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/build.make 2024-06-26T07:05:51.4197126Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/DependInfo.cmake 2024-06-26T07:05:51.4198279Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/flags.make 2024-06-26T07:05:51.4199417Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/progress.make 2024-06-26T07:05:51.4209657Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o.d 2024-06-26T07:05:51.4271231Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o 2024-06-26T07:05:51.4272606Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeDirectoryInformation.cmake 2024-06-26T07:05:51.4273917Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/TargetDirectories.txt 2024-06-26T07:05:51.4275007Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/progress.marks 2024-06-26T07:05:51.4276047Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile2 2024-06-26T07:05:51.4277149Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile.cmake 2024-06-26T07:05:51.4278205Z inflating: build/custom_test_artifacts/jit-hook-build/detect_cuda_version.cc 2024-06-26T07:05:51.4279054Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeCache.txt 2024-06-26T07:05:51.4279956Z inflating: build/custom_test_artifacts/jit-hook-build/Makefile 2024-06-26T07:05:51.4280785Z inflating: build/custom_test_artifacts/jit-hook-build/cmake_install.cmake 2024-06-26T07:05:51.4325737Z inflating: build/custom_test_artifacts/jit-hook-build/test_jit_hooks 2024-06-26T07:05:51.4326543Z creating: build/custom_test_artifacts/custom-backend-build/ 2024-06-26T07:05:51.4327312Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/ 2024-06-26T07:05:51.4328201Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/pkgRedirects/ 2024-06-26T07:05:51.4333575Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeConfigureLog.yaml 2024-06-26T07:05:51.4334631Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/ 2024-06-26T07:05:51.4335745Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeSystem.cmake 2024-06-26T07:05:51.4336994Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/ 2024-06-26T07:05:51.4338254Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/tmp/ 2024-06-26T07:05:51.4339493Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/CMakeCCompilerId.c 2024-06-26T07:05:51.4340921Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdC/a.out 2024-06-26T07:05:51.4342124Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/ 2024-06-26T07:05:51.4343359Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/tmp/ 2024-06-26T07:05:51.4344727Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/CMakeCXXCompilerId.cpp 2024-06-26T07:05:51.4346498Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCXX/a.out 2024-06-26T07:05:51.4347767Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_C.bin 2024-06-26T07:05:51.4349035Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeCCompiler.cmake 2024-06-26T07:05:51.4350316Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CXX.bin 2024-06-26T07:05:51.4351595Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeCXXCompiler.cmake 2024-06-26T07:05:51.4352730Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/ 2024-06-26T07:05:51.4353838Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/ 2024-06-26T07:05:51.4385666Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2024-06-26T07:05:51.4423447Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2024-06-26T07:05:51.4425134Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2024-06-26T07:05:51.4467918Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2024-06-26T07:05:51.4469802Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2024-06-26T07:05:51.4471658Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2024-06-26T07:05:51.4473557Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2024-06-26T07:05:51.4475188Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2024-06-26T07:05:51.4476805Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2024-06-26T07:05:51.4478302Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2024-06-26T07:05:51.4479766Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2024-06-26T07:05:51.4481197Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2024-06-26T07:05:51.4482565Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2024-06-26T07:05:51.4483886Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.reg.c 2024-06-26T07:05:51.4485439Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin 2024-06-26T07:05:51.4486768Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2024-06-26T07:05:51.4488058Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/tmp/a_dlink.o 2024-06-26T07:05:51.4489366Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/CMakeCUDACompilerId.cu 2024-06-26T07:05:51.4533510Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CompilerIdCUDA/a.out 2024-06-26T07:05:51.4591449Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeDetermineCompilerABI_CUDA.bin 2024-06-26T07:05:51.4593078Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.26.4/CMakeCUDACompiler.cmake 2024-06-26T07:05:51.4594409Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeScratch/ 2024-06-26T07:05:51.4595613Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeTmp/ 2024-06-26T07:05:51.4596780Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/cmake.check_cache 2024-06-26T07:05:51.4597812Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/ 2024-06-26T07:05:51.4598973Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.ts 2024-06-26T07:05:51.4600450Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.make 2024-06-26T07:05:51.4601748Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/depend.make 2024-06-26T07:05:51.4602936Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/link.txt 2024-06-26T07:05:51.4604149Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/cmake_clean.cmake 2024-06-26T07:05:51.4605829Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/build.make 2024-06-26T07:05:51.4607079Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/DependInfo.cmake 2024-06-26T07:05:51.4608290Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/flags.make 2024-06-26T07:05:51.4609499Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/progress.make 2024-06-26T07:05:51.4610790Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o.d 2024-06-26T07:05:51.4721699Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o 2024-06-26T07:05:51.4723147Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/ 2024-06-26T07:05:51.4724891Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.ts 2024-06-26T07:05:51.4727367Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.make 2024-06-26T07:05:51.4728696Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/depend.make 2024-06-26T07:05:51.4729948Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/link.txt 2024-06-26T07:05:51.4731244Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/cmake_clean.cmake 2024-06-26T07:05:51.4732544Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/build.make 2024-06-26T07:05:51.4733814Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/DependInfo.cmake 2024-06-26T07:05:51.4735068Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/flags.make 2024-06-26T07:05:51.4736338Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/progress.make 2024-06-26T07:05:51.4745418Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o.d 2024-06-26T07:05:51.4798207Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o 2024-06-26T07:05:51.4799528Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeDirectoryInformation.cmake 2024-06-26T07:05:51.4800686Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/TargetDirectories.txt 2024-06-26T07:05:51.4801739Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/progress.marks 2024-06-26T07:05:51.4802710Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile2 2024-06-26T07:05:51.4803700Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile.cmake 2024-06-26T07:05:51.4804917Z inflating: build/custom_test_artifacts/custom-backend-build/detect_cuda_version.cc 2024-06-26T07:05:51.4805846Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeCache.txt 2024-06-26T07:05:51.4806684Z inflating: build/custom_test_artifacts/custom-backend-build/Makefile 2024-06-26T07:05:51.4807563Z inflating: build/custom_test_artifacts/custom-backend-build/cmake_install.cmake 2024-06-26T07:05:51.4904315Z inflating: build/custom_test_artifacts/custom-backend-build/libcustom_backend.so 2024-06-26T07:05:51.4944743Z inflating: build/custom_test_artifacts/custom-backend-build/test_custom_backend 2024-06-26T07:05:51.4945429Z creating: build/lib/ 2024-06-26T07:05:51.5030247Z inflating: build/lib/libprotobuf-lite.a 2024-06-26T07:05:51.5468192Z inflating: build/lib/libprotobuf.a 2024-06-26T07:05:51.5477536Z inflating: build/lib/libpthreadpool.a 2024-06-26T07:05:51.5485289Z inflating: build/lib/libcpuinfo.a 2024-06-26T07:05:51.5493064Z inflating: build/lib/libcpuinfo_internals.a 2024-06-26T07:05:51.5493906Z inflating: build/lib/libclog.a 2024-06-26T07:05:51.5511008Z inflating: build/lib/libnnpack.a 2024-06-26T07:05:51.5513108Z inflating: build/lib/libnnpack_reference_layers.a 2024-06-26T07:05:51.5575947Z inflating: build/lib/libgtest.a 2024-06-26T07:05:51.5648850Z inflating: build/lib/libbenchmark.a 2024-06-26T07:05:51.5709867Z inflating: build/lib/libasmjit.a 2024-06-26T07:05:51.5716950Z inflating: build/lib/libittnotify.a 2024-06-26T07:05:51.5743563Z inflating: build/lib/libtensorpipe_uv.a 2024-06-26T07:05:51.5865995Z inflating: build/lib/libgloo.a 2024-06-26T07:05:51.5887326Z inflating: build/lib/libfmt.a 2024-06-26T07:05:51.5975699Z inflating: build/lib/libc10.so 2024-06-26T07:05:51.5977165Z inflating: build/lib/libcaffe2_nvrtc.so 2024-06-26T07:05:51.5977861Z inflating: build/lib/libfoxi_loader.a 2024-06-26T07:05:51.5978757Z inflating: build/lib/libtorch_global_deps.so 2024-06-26T07:05:51.5997723Z inflating: build/lib/libpytorch_qnnpack.a 2024-06-26T07:05:51.6015009Z inflating: build/lib/libgmock.a 2024-06-26T07:05:51.6015447Z inflating: build/lib/libgtest_main.a 2024-06-26T07:05:51.6016103Z inflating: build/lib/libbenchmark_main.a 2024-06-26T07:05:51.6509476Z inflating: build/lib/libprotoc.a 2024-06-26T07:05:51.6880340Z inflating: build/lib/libgloo_cuda.a 2024-06-26T07:05:52.6616751Z inflating: build/lib/libdnnl.a 2024-06-26T07:05:52.7166833Z inflating: build/lib/libtensorpipe.a 2024-06-26T07:05:52.7219084Z inflating: build/lib/libc10_cuda.so 2024-06-26T07:05:52.7219544Z inflating: build/lib/libgmock_main.a 2024-06-26T07:05:52.8423255Z inflating: build/lib/libfbgemm.a 2024-06-26T07:05:52.8669185Z inflating: build/lib/libtensorpipe_cuda.a 2024-06-26T07:05:52.9167072Z inflating: build/lib/libkineto.a 2024-06-26T07:05:52.9344695Z inflating: build/lib/libXNNPACK.a 2024-06-26T07:05:52.9385313Z inflating: build/lib/libonnx_proto.a 2024-06-26T07:05:53.0051542Z inflating: build/lib/libonnx.a 2024-06-26T07:05:55.3347689Z inflating: build/lib/libtorch_cpu.so 2024-06-26T07:05:55.3351829Z inflating: build/lib/libunbox_lib.a 2024-06-26T07:05:55.3355655Z inflating: build/lib/libshm.so 2024-06-26T07:05:57.3230341Z inflating: build/lib/libtorch_cuda.so 2024-06-26T07:05:57.3231044Z inflating: build/lib/libtorch.so 2024-06-26T07:05:57.3233339Z inflating: build/lib/libc10d_cuda_test.so 2024-06-26T07:05:58.1183462Z inflating: build/lib/libtorch_cuda_linalg.so 2024-06-26T07:05:58.3077672Z inflating: build/lib/libtorch_python.so 2024-06-26T07:05:58.3144294Z inflating: build/lib/libtorchbind_test.so 2024-06-26T07:05:58.3164176Z inflating: build/lib/libjitbackend_test.so 2024-06-26T07:05:58.3188662Z inflating: build/lib/libbackend_with_compiler.so 2024-06-26T07:05:58.3221947Z inflating: build/lib/libnnapi_backend.so 2024-06-26T07:05:58.3222416Z creating: build/bin/ 2024-06-26T07:05:58.3270947Z inflating: build/bin/c10_CompileTimeFunctionPointer_test 2024-06-26T07:05:58.3320631Z inflating: build/bin/c10_DeviceGuard_test 2024-06-26T07:05:58.3369803Z inflating: build/bin/c10_Device_test 2024-06-26T07:05:58.3427160Z inflating: build/bin/c10_DispatchKeySet_test 2024-06-26T07:05:58.3478825Z inflating: build/bin/c10_Scalar_test 2024-06-26T07:05:58.3526636Z inflating: build/bin/c10_StreamGuard_test 2024-06-26T07:05:58.3576687Z inflating: build/bin/c10_SymInt_test 2024-06-26T07:05:58.3629664Z inflating: build/bin/c10_InlineDeviceGuard_test 2024-06-26T07:05:58.3683477Z inflating: build/bin/c10_InlineStreamGuard_test 2024-06-26T07:05:58.3738922Z inflating: build/bin/c10_SizesAndStrides_test 2024-06-26T07:05:58.3807551Z inflating: build/bin/c10_cow_test 2024-06-26T07:05:58.3858960Z inflating: build/bin/c10_Bitset_test 2024-06-26T07:05:58.3906353Z inflating: build/bin/c10_ConstexprCrc_test 2024-06-26T07:05:58.3955150Z inflating: build/bin/c10_DeadlockDetection_test 2024-06-26T07:05:58.4003866Z inflating: build/bin/c10_Half_test 2024-06-26T07:05:58.4058993Z inflating: build/bin/c10_LeftRight_test 2024-06-26T07:05:58.4112397Z inflating: build/bin/c10_Metaprogramming_test 2024-06-26T07:05:58.4161497Z inflating: build/bin/c10_Synchronized_test 2024-06-26T07:05:58.4215451Z inflating: build/bin/c10_ThreadLocal_test 2024-06-26T07:05:58.4265164Z inflating: build/bin/c10_TypeIndex_test 2024-06-26T07:05:58.4314624Z inflating: build/bin/c10_TypeList_test 2024-06-26T07:05:58.4362036Z inflating: build/bin/c10_TypeTraits_test 2024-06-26T07:05:58.4412632Z inflating: build/bin/c10_accumulate_test 2024-06-26T07:05:58.4465660Z inflating: build/bin/c10_bfloat16_test 2024-06-26T07:05:58.4515053Z inflating: build/bin/c10_bit_cast_test 2024-06-26T07:05:58.4569095Z inflating: build/bin/c10_complex_math_test 2024-06-26T07:05:58.4620215Z inflating: build/bin/c10_exception_test 2024-06-26T07:05:58.4674034Z inflating: build/bin/c10_complex_test 2024-06-26T07:05:58.4722638Z inflating: build/bin/c10_flags_test 2024-06-26T07:05:58.4771098Z inflating: build/bin/c10_generic_math_test 2024-06-26T07:05:58.4820103Z inflating: build/bin/c10_irange_test 2024-06-26T07:05:58.4978543Z inflating: build/bin/c10_intrusive_ptr_test 2024-06-26T07:05:58.5030829Z inflating: build/bin/c10_lazy_test 2024-06-26T07:05:58.5085519Z inflating: build/bin/c10_logging_test 2024-06-26T07:05:58.5158048Z inflating: build/bin/c10_optional_test 2024-06-26T07:05:58.5218592Z inflating: build/bin/c10_ordered_preserving_dict_test 2024-06-26T07:05:58.5271235Z inflating: build/bin/c10_registry_test 2024-06-26T07:05:58.5418144Z inflating: build/bin/c10_small_vector_test 2024-06-26T07:05:58.5468320Z inflating: build/bin/c10_ssize_test 2024-06-26T07:05:58.5518576Z inflating: build/bin/c10_string_util_test 2024-06-26T07:05:58.5575234Z inflating: build/bin/c10_string_view_test 2024-06-26T07:05:58.5623971Z inflating: build/bin/c10_tempfile_test 2024-06-26T07:05:58.5678601Z inflating: build/bin/c10_typeid_test 2024-06-26T07:05:58.5725754Z inflating: build/bin/c10_intrusive_ptr_benchmark 2024-06-26T07:05:58.6159123Z inflating: build/bin/protoc-3.13.0.0 2024-06-26T07:05:58.6591311Z inflating: build/bin/protoc 2024-06-26T07:05:58.6642443Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_1_var_test 2024-06-26T07:05:58.6693636Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_stream 2024-06-26T07:05:58.6744581Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_thread_and_block_and_device 2024-06-26T07:05:58.6794811Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_from_2_processes 2024-06-26T07:05:58.6845950Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_multiple_blocks 2024-06-26T07:05:58.6897629Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_blocks_and_threads 2024-06-26T07:05:58.6944380Z inflating: build/bin/c10_cuda_CUDATest 2024-06-26T07:05:58.6995462Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_same_block 2024-06-26T07:05:58.7313462Z inflating: build/bin/vec_test_all_types_DEFAULT 2024-06-26T07:05:58.7647852Z inflating: build/bin/vec_test_all_types_AVX512 2024-06-26T07:05:58.7992927Z inflating: build/bin/vec_test_all_types_AVX2 2024-06-26T07:05:58.8044228Z inflating: build/bin/FileStoreTest 2024-06-26T07:05:58.8098034Z inflating: build/bin/TCPStoreTest 2024-06-26T07:05:58.8149618Z inflating: build/bin/HashStoreTest 2024-06-26T07:05:58.8162434Z inflating: build/bin/ProcessGroupMPITest 2024-06-26T07:05:58.8215250Z inflating: build/bin/test_edge_op_registration 2024-06-26T07:05:58.8219236Z inflating: build/bin/torch_shm_manager 2024-06-26T07:05:58.8221826Z inflating: build/bin/example_allreduce 2024-06-26T07:05:58.8275603Z inflating: build/bin/test_dist_autograd 2024-06-26T07:05:58.8342499Z inflating: build/bin/test_cpp_rpc 2024-06-26T07:05:58.8344653Z inflating: build/bin/parallel_benchmark 2024-06-26T07:05:58.8410170Z inflating: build/bin/test_mobile_nnc 2024-06-26T07:05:58.8418798Z inflating: build/bin/aot_model_compiler_test 2024-06-26T07:05:58.8752601Z inflating: build/bin/test_lazy 2024-06-26T07:05:58.9894926Z inflating: build/bin/test_api 2024-06-26T07:05:58.9966197Z inflating: build/bin/Dict_test 2024-06-26T07:05:59.0016909Z inflating: build/bin/Dimname_test 2024-06-26T07:05:59.0078864Z inflating: build/bin/MaybeOwned_test 2024-06-26T07:05:59.0134745Z inflating: build/bin/NamedTensor_test 2024-06-26T07:05:59.0192390Z inflating: build/bin/apply_utils_test 2024-06-26T07:05:59.0249083Z inflating: build/bin/atest 2024-06-26T07:05:59.0310570Z inflating: build/bin/basic 2024-06-26T07:05:59.0363109Z inflating: build/bin/broadcast_test 2024-06-26T07:05:59.0412319Z inflating: build/bin/cpu_allocator_test 2024-06-26T07:05:59.0468717Z inflating: build/bin/cpu_generator_test 2024-06-26T07:05:59.0520193Z inflating: build/bin/cpu_profiling_allocator_test 2024-06-26T07:05:59.0610956Z inflating: build/bin/cpu_rng_test 2024-06-26T07:05:59.0658986Z inflating: build/bin/dispatch_key_set_test 2024-06-26T07:05:59.0707698Z inflating: build/bin/dlconvertor_test 2024-06-26T07:05:59.0763881Z inflating: build/bin/extension_backend_test 2024-06-26T07:05:59.0817531Z inflating: build/bin/half_test 2024-06-26T07:05:59.0910911Z inflating: build/bin/ivalue_test 2024-06-26T07:05:59.0958490Z inflating: build/bin/lazy_tensor_test 2024-06-26T07:05:59.1011159Z inflating: build/bin/math_kernel_test 2024-06-26T07:05:59.1063913Z inflating: build/bin/memory_format_test 2024-06-26T07:05:59.1115954Z inflating: build/bin/memory_overlapping_test 2024-06-26T07:05:59.1166894Z inflating: build/bin/mobile_memory_cleanup 2024-06-26T07:05:59.1220984Z inflating: build/bin/native_test 2024-06-26T07:05:59.1271415Z inflating: build/bin/operator_name_test 2024-06-26T07:05:59.1320884Z inflating: build/bin/operators_test 2024-06-26T07:05:59.1371199Z inflating: build/bin/packedtensoraccessor_test 2024-06-26T07:05:59.1436585Z inflating: build/bin/pow_test 2024-06-26T07:05:59.1492430Z inflating: build/bin/quantized_test 2024-06-26T07:05:59.1541154Z inflating: build/bin/reduce_ops_test 2024-06-26T07:05:59.1590846Z inflating: build/bin/reportMemoryUsage_test 2024-06-26T07:05:59.1645730Z inflating: build/bin/scalar_tensor_test 2024-06-26T07:05:59.1702632Z inflating: build/bin/scalar_test 2024-06-26T07:05:59.1752748Z inflating: build/bin/StorageUtils_test 2024-06-26T07:05:59.1803631Z inflating: build/bin/stride_properties_test 2024-06-26T07:05:59.1879922Z inflating: build/bin/tensor_iterator_test 2024-06-26T07:05:59.1932593Z inflating: build/bin/test_parallel 2024-06-26T07:05:59.1935386Z inflating: build/bin/thread_init_test 2024-06-26T07:05:59.1989488Z inflating: build/bin/type_ptr_test 2024-06-26T07:05:59.2047644Z inflating: build/bin/type_test 2024-06-26T07:05:59.2098115Z inflating: build/bin/undefined_tensor_test 2024-06-26T07:05:59.2099496Z inflating: build/bin/verify_api_visibility 2024-06-26T07:05:59.2166677Z inflating: build/bin/legacy_vmap_test 2024-06-26T07:05:59.2217376Z inflating: build/bin/weakref_test 2024-06-26T07:05:59.2267069Z inflating: build/bin/wrapdim_test 2024-06-26T07:05:59.2317090Z inflating: build/bin/xla_tensor_test 2024-06-26T07:05:59.2375330Z inflating: build/bin/IListRef_test 2024-06-26T07:05:59.2478088Z inflating: build/bin/List_test 2024-06-26T07:05:59.2542896Z inflating: build/bin/KernelFunction_test 2024-06-26T07:05:59.2660950Z inflating: build/bin/kernel_function_legacy_test 2024-06-26T07:05:59.2754700Z inflating: build/bin/kernel_function_test 2024-06-26T07:05:59.2878616Z inflating: build/bin/kernel_lambda_legacy_test 2024-06-26T07:05:59.2979780Z inflating: build/bin/kernel_lambda_test 2024-06-26T07:05:59.3038550Z inflating: build/bin/kernel_stackbased_test 2024-06-26T07:05:59.3088406Z inflating: build/bin/CppSignature_test 2024-06-26T07:05:59.3182420Z inflating: build/bin/make_boxed_from_unboxed_functor_test 2024-06-26T07:05:59.3229875Z inflating: build/bin/op_allowlist_test 2024-06-26T07:05:59.3283809Z inflating: build/bin/backend_fallback_test 2024-06-26T07:05:59.3576399Z inflating: build/bin/op_registration_test 2024-06-26T07:05:59.3638511Z inflating: build/bin/inline_container_test 2024-06-26T07:05:59.3689621Z inflating: build/bin/cuda_apply_test 2024-06-26T07:05:59.3738895Z inflating: build/bin/cuda_allocator_test 2024-06-26T07:05:59.3797059Z inflating: build/bin/cuda_atomic_ops_test 2024-06-26T07:05:59.3850131Z inflating: build/bin/cuda_caching_host_allocator_test 2024-06-26T07:05:59.3917415Z inflating: build/bin/cuda_complex_math_test 2024-06-26T07:05:59.3974826Z inflating: build/bin/cuda_complex_test 2024-06-26T07:05:59.4023183Z inflating: build/bin/cuda_device_test 2024-06-26T07:05:59.4078793Z inflating: build/bin/cuda_cub_test 2024-06-26T07:05:59.4128082Z inflating: build/bin/cuda_dlconvertor_test 2024-06-26T07:05:59.4191034Z inflating: build/bin/cuda_distributions_test 2024-06-26T07:05:59.4245983Z inflating: build/bin/cuda_generator_test 2024-06-26T07:05:59.4294338Z inflating: build/bin/cuda_half_test 2024-06-26T07:05:59.4343942Z inflating: build/bin/cuda_integer_divider_test 2024-06-26T07:05:59.4391525Z inflating: build/bin/cuda_optional_test 2024-06-26T07:05:59.4441633Z inflating: build/bin/cuda_packedtensoraccessor_test 2024-06-26T07:05:59.4492640Z inflating: build/bin/cuda_reportMemoryUsage_test 2024-06-26T07:05:59.4541173Z inflating: build/bin/cuda_allocatorTraceTracker_test 2024-06-26T07:05:59.4600053Z inflating: build/bin/cuda_stream_test 2024-06-26T07:05:59.4647942Z inflating: build/bin/cuda_cudnn_test 2024-06-26T07:05:59.4699210Z inflating: build/bin/cuda_vectorized_test 2024-06-26T07:05:59.4713678Z inflating: build/bin/tutorial_tensorexpr 2024-06-26T07:05:59.4777455Z inflating: build/bin/ProcessGroupGlooTest 2024-06-26T07:05:59.4833273Z inflating: build/bin/ProcessGroupGlooAsyncTest 2024-06-26T07:05:59.4895812Z inflating: build/bin/ProcessGroupNCCLTest 2024-06-26T07:05:59.4956225Z inflating: build/bin/ProcessGroupNCCLErrorsTest 2024-06-26T07:05:59.5763324Z inflating: build/bin/test_tensorexpr 2024-06-26T07:05:59.6324516Z inflating: build/bin/test_jit 2024-06-26T07:05:59.6325291Z creating: .additional_ci_files/ 2024-06-26T07:05:59.6378458Z inflating: .additional_ci_files/test-times.json 2024-06-26T07:05:59.6589631Z inflating: .additional_ci_files/test-class-times.json 2024-06-26T07:05:59.6619112Z ##[group]Run rm artifacts.zip 2024-06-26T07:05:59.6619514Z rm artifacts.zip 2024-06-26T07:05:59.6628374Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:05:59.6628956Z env: 2024-06-26T07:05:59.6629267Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:59.6629730Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:59.6630234Z ##[endgroup] 2024-06-26T07:05:59.7357693Z ##[group]Run df -H 2024-06-26T07:05:59.7358017Z df -H 2024-06-26T07:05:59.7366021Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:05:59.7366513Z env: 2024-06-26T07:05:59.7366793Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:59.7367247Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:59.7367722Z ##[endgroup] 2024-06-26T07:05:59.7393948Z Filesystem Size Used Avail Use% Mounted on 2024-06-26T07:05:59.7394643Z devtmpfs 34G 0 34G 0% /dev 2024-06-26T07:05:59.7395138Z tmpfs 34G 82k 34G 1% /dev/shm 2024-06-26T07:05:59.7395864Z tmpfs 34G 455k 34G 1% /run 2024-06-26T07:05:59.7396348Z tmpfs 34G 0 34G 0% /sys/fs/cgroup 2024-06-26T07:05:59.7396839Z /dev/nvme0n1p1 162G 36G 126G 23% / 2024-06-26T07:05:59.7435689Z Prepare all required actions 2024-06-26T07:05:59.7436126Z Getting action download info 2024-06-26T07:05:59.8486762Z ##[group]Run ./.github/actions/download-td-artifacts 2024-06-26T07:05:59.8487232Z with: 2024-06-26T07:05:59.8487491Z env: 2024-06-26T07:05:59.8487770Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:59.8488218Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:59.8488708Z ##[endgroup] 2024-06-26T07:05:59.8521802Z ##[group]Run seemethere/download-artifact-s3@v4 2024-06-26T07:05:59.8522247Z with: 2024-06-26T07:05:59.8522507Z name: td_results 2024-06-26T07:05:59.8522826Z s3-bucket: gha-artifacts 2024-06-26T07:05:59.8523185Z region: us-east-1 2024-06-26T07:05:59.8523477Z env: 2024-06-26T07:05:59.8523759Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:05:59.8524223Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:05:59.8524869Z ##[endgroup] 2024-06-26T07:06:00.2709382Z (node:17976) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2024-06-26T07:06:00.2710858Z 2024-06-26T07:06:00.2711580Z Please migrate your code to use AWS SDK for JavaScript (v3). 2024-06-26T07:06:00.2713582Z For more information, check the migration guide at https://a.co/7PzMCcy 2024-06-26T07:06:00.2732374Z (Use `node --trace-warnings ...` to show where the warning was created) 2024-06-26T07:06:00.3461381Z Found 1 objects with prefix pytorch/pytorch/9673645538/td_results/ 2024-06-26T07:06:00.3462434Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2024-06-26T07:06:00.4143939Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2024-06-26T07:06:00.4149865Z Artifact download has finished successfully 2024-06-26T07:06:00.4293618Z ##[group]Run mkdir -p .additional_ci_files 2024-06-26T07:06:00.4294099Z mkdir -p .additional_ci_files 2024-06-26T07:06:00.4294690Z mv td_results.json .additional_ci_files/td_results.json 2024-06-26T07:06:00.4302678Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:06:00.4303178Z env: 2024-06-26T07:06:00.4303455Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:00.4303902Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:00.4304387Z ##[endgroup] 2024-06-26T07:06:00.4370525Z ##[group]Run .github/scripts/parse_ref.py 2024-06-26T07:06:00.4371008Z .github/scripts/parse_ref.py 2024-06-26T07:06:00.4378539Z shell: /usr/bin/bash -e {0} 2024-06-26T07:06:00.4378888Z env: 2024-06-26T07:06:00.4379180Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:00.4379640Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:00.4380120Z ##[endgroup] 2024-06-26T07:06:00.4610742Z Prepare all required actions 2024-06-26T07:06:00.4650801Z ##[group]Run ./.github/actions/get-workflow-job-id 2024-06-26T07:06:00.4651260Z with: 2024-06-26T07:06:00.4651828Z github-token: *** 2024-06-26T07:06:00.4652168Z env: 2024-06-26T07:06:00.4652536Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:00.4652997Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:00.4653481Z ##[endgroup] 2024-06-26T07:06:00.4671246Z ##[group]Run set -eux 2024-06-26T07:06:00.4671577Z set -eux 2024-06-26T07:06:00.4672177Z python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2024-06-26T07:06:00.4679946Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:06:00.4680441Z env: 2024-06-26T07:06:00.4680716Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:00.4681171Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:00.4681822Z GITHUB_TOKEN: *** 2024-06-26T07:06:00.4682125Z ##[endgroup] 2024-06-26T07:06:00.4702066Z + python3 .github/scripts/get_workflow_job_id.py 9673645538 i-089977a14ceff415f 2024-06-26T07:06:03.6912022Z setting job-id=26688880777 2024-06-26T07:06:03.6913298Z setting job-name=linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T07:06:03.7117467Z Prepare all required actions 2024-06-26T07:06:03.7117920Z Getting action download info 2024-06-26T07:06:03.8122888Z ##[group]Run ./.github/actions/filter-test-configs 2024-06-26T07:06:03.8123353Z with: 2024-06-26T07:06:03.8123817Z github-token: *** 2024-06-26T07:06:03.8126067Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]} 2024-06-26T07:06:03.8128588Z job-name: linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T07:06:03.8129475Z env: 2024-06-26T07:06:03.8129750Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:03.8130233Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:03.8130708Z ##[endgroup] 2024-06-26T07:06:03.8173409Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2024-06-26T07:06:03.8173969Z with: 2024-06-26T07:06:03.8174241Z shell: bash 2024-06-26T07:06:03.8174539Z timeout_minutes: 10 2024-06-26T07:06:03.8174852Z max_attempts: 5 2024-06-26T07:06:03.8175165Z retry_wait_seconds: 30 2024-06-26T07:06:03.8176308Z command: set -eux # PyYAML 6.0 doesn't work with MacOS x86 anymore # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2 python3 -m pip install requests==2.27.1 pyyaml==6.0.1 2024-06-26T07:06:03.8177496Z polling_interval_seconds: 1 2024-06-26T07:06:03.8177870Z warning_on_retry: true 2024-06-26T07:06:03.8178217Z continue_on_error: false 2024-06-26T07:06:03.8178545Z env: 2024-06-26T07:06:03.8178818Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:03.8179282Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:03.8179935Z GITHUB_TOKEN: *** 2024-06-26T07:06:03.8180239Z ##[endgroup] 2024-06-26T07:06:03.8668263Z + python3 -m pip install requests==2.27.1 pyyaml==6.0.1 2024-06-26T07:06:04.0782945Z Defaulting to user installation because normal site-packages is not writeable 2024-06-26T07:06:04.0935017Z Requirement already satisfied: requests==2.27.1 in /home/ec2-user/.local/lib/python3.7/site-packages (2.27.1) 2024-06-26T07:06:04.1074615Z Requirement already satisfied: pyyaml==6.0.1 in /home/ec2-user/.local/lib/python3.7/site-packages (6.0.1) 2024-06-26T07:06:04.1083184Z Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= "3" in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.27.1) (2.0.12) 2024-06-26T07:06:04.1103262Z Requirement already satisfied: idna<4,>=2.5; python_version >= "3" in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.27.1) (3.7) 2024-06-26T07:06:04.1114619Z Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.27.1) (1.26.19) 2024-06-26T07:06:04.1307413Z Requirement already satisfied: certifi>=2017.4.17 in /home/ec2-user/.local/lib/python3.7/site-packages (from requests==2.27.1) (2024.6.2) 2024-06-26T07:06:04.8669348Z Command completed after 1 attempt(s). 2024-06-26T07:06:04.8711171Z ##[group]Run set -x 2024-06-26T07:06:04.8711510Z set -x 2024-06-26T07:06:04.8711807Z  2024-06-26T07:06:04.8712363Z # Use relative path here as this could be checked out anywhere, not necessarily 2024-06-26T07:06:04.8713052Z # in runner workspace 2024-06-26T07:06:04.8713584Z python3 "${GITHUB_ACTION_PATH}/../../scripts/parse_ref.py" 2024-06-26T07:06:04.8721449Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:06:04.8722044Z env: 2024-06-26T07:06:04.8722322Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:04.8722789Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:04.8723413Z ##[endgroup] 2024-06-26T07:06:04.8743849Z + python3 /home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/parse_ref.py 2024-06-26T07:06:04.8942499Z ##[group]Run echo "Workflow: ${GITHUB_WORKFLOW}" 2024-06-26T07:06:04.8943055Z echo "Workflow: ${GITHUB_WORKFLOW}" 2024-06-26T07:06:04.8943528Z echo "Job name: ${JOB_NAME}" 2024-06-26T07:06:04.8943929Z  2024-06-26T07:06:04.8944489Z # Use relative path here as this could be checked out anywhere, not necessarily 2024-06-26T07:06:04.8945194Z # in runner workspace 2024-06-26T07:06:04.8945879Z python3 "${GITHUB_ACTION_PATH}/../../scripts/filter_test_configs.py" \ 2024-06-26T07:06:04.8946551Z  --workflow "${GITHUB_WORKFLOW}" \ 2024-06-26T07:06:04.8947023Z  --job-name "${JOB_NAME}" \ 2024-06-26T07:06:04.8949378Z  --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}" \ 2024-06-26T07:06:04.8951917Z  --selected-test-configs "" \ 2024-06-26T07:06:04.8952382Z  --pr-number "${PR_NUMBER}" \ 2024-06-26T07:06:04.8952820Z  --tag "${TAG}" \ 2024-06-26T07:06:04.8953230Z  --event-name "${EVENT_NAME}" \ 2024-06-26T07:06:04.8953683Z  --schedule "${SCHEDULE}" \ 2024-06-26T07:06:04.8954126Z  --branch "${HEAD_BRANCH}" 2024-06-26T07:06:04.8961883Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:06:04.8962383Z env: 2024-06-26T07:06:04.8962681Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:04.8963148Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:04.8963840Z GITHUB_TOKEN: *** 2024-06-26T07:06:04.8964519Z JOB_NAME: linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T07:06:04.8965427Z PR_NUMBER: 129470 2024-06-26T07:06:04.8965747Z TAG: 2024-06-26T07:06:04.8966037Z EVENT_NAME: pull_request 2024-06-26T07:06:04.8966390Z SCHEDULE: 2024-06-26T07:06:04.8966690Z HEAD_BRANCH: 2024-06-26T07:06:04.8966983Z ##[endgroup] 2024-06-26T07:06:04.8985850Z Workflow: pull 2024-06-26T07:06:04.8986761Z Job name: linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T07:06:05.0927535Z INFO:root:Found no test-config label on the PR, so all test configs are included 2024-06-26T07:06:05.2287788Z ##[group]Run echo "Filtered matrix:" 2024-06-26T07:06:05.2288235Z echo "Filtered matrix:" 2024-06-26T07:06:05.2290440Z echo "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}" 2024-06-26T07:06:05.2292616Z  2024-06-26T07:06:05.2292886Z echo 2024-06-26T07:06:05.2293261Z echo "Is the current job unstable? False" 2024-06-26T07:06:05.2293721Z  2024-06-26T07:06:05.2293984Z echo 2024-06-26T07:06:05.2294339Z echo "Is keep-going label set? False" 2024-06-26T07:06:05.2294774Z  2024-06-26T07:06:05.2295034Z echo 2024-06-26T07:06:05.2295354Z echo "Renabled issues? " 2024-06-26T07:06:05.2302995Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:06:05.2303492Z env: 2024-06-26T07:06:05.2303893Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:05.2304369Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:05.2304871Z ##[endgroup] 2024-06-26T07:06:05.2325130Z Filtered matrix: 2024-06-26T07:06:05.2327556Z {include: [{config: default, shard: 1, num_shards: 5, runner: linux.g5.4xlarge.nvidia.gpu}, {config: default, shard: 2, num_shards: 5, runner: linux.g5.4xlarge.nvidia.gpu}, {config: default, shard: 3, num_shards: 5, runner: linux.g5.4xlarge.nvidia.gpu}, {config: default, shard: 4, num_shards: 5, runner: linux.g5.4xlarge.nvidia.gpu}, {config: default, shard: 5, num_shards: 5, runner: linux.g5.4xlarge.nvidia.gpu}]} 2024-06-26T07:06:05.2329622Z 2024-06-26T07:06:05.2329778Z Is the current job unstable? False 2024-06-26T07:06:05.2330067Z 2024-06-26T07:06:05.2330339Z Is keep-going label set? False 2024-06-26T07:06:05.2330603Z 2024-06-26T07:06:05.2330727Z Renabled issues? 2024-06-26T07:06:05.2389444Z ##[group]Run echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2024-06-26T07:06:05.2390152Z echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2024-06-26T07:06:05.2397801Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T07:06:05.2398307Z env: 2024-06-26T07:06:05.2398576Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:05.2399036Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:05.2399523Z JOB_TIMEOUT: 240 2024-06-26T07:06:05.2399828Z ##[endgroup] 2024-06-26T07:06:05.2469575Z ##[group]Run set -x 2024-06-26T07:06:05.2469967Z set -x 2024-06-26T07:06:05.2470271Z  2024-06-26T07:06:05.2470629Z if [[ $TEST_CONFIG == 'multigpu' ]]; then 2024-06-26T07:06:05.2471173Z  TEST_COMMAND=.ci/pytorch/multigpu-test.sh 2024-06-26T07:06:05.2471745Z elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then 2024-06-26T07:06:05.2472261Z  TEST_COMMAND=.ci/onnx/test.sh 2024-06-26T07:06:05.2472688Z else 2024-06-26T07:06:05.2473036Z  TEST_COMMAND=.ci/pytorch/test.sh 2024-06-26T07:06:05.2473466Z fi 2024-06-26T07:06:05.2473746Z  2024-06-26T07:06:05.2474225Z # detached container should get cleaned up by teardown_ec2_linux 2024-06-26T07:06:05.2474994Z # TODO: Stop building test binaries as part of the build phase 2024-06-26T07:06:05.2475677Z # Used for GPU_FLAG since that doesn't play nice 2024-06-26T07:06:05.2476270Z # shellcheck disable=SC2086,SC2090 2024-06-26T07:06:05.2476741Z container_name=$(docker run \ 2024-06-26T07:06:05.2477158Z  ${GPU_FLAG:-} \ 2024-06-26T07:06:05.2477535Z  -e BUILD_ENVIRONMENT \ 2024-06-26T07:06:05.2477936Z  -e PR_NUMBER \ 2024-06-26T07:06:05.2478294Z  -e GITHUB_ACTIONS \ 2024-06-26T07:06:05.2478688Z  -e GITHUB_REPOSITORY \ 2024-06-26T07:06:05.2479098Z  -e GITHUB_WORKFLOW \ 2024-06-26T07:06:05.2479477Z  -e GITHUB_JOB \ 2024-06-26T07:06:05.2479848Z  -e GITHUB_RUN_ID \ 2024-06-26T07:06:05.2480237Z  -e GITHUB_RUN_NUMBER \ 2024-06-26T07:06:05.2480641Z  -e GITHUB_RUN_ATTEMPT \ 2024-06-26T07:06:05.2481040Z  -e JOB_ID \ 2024-06-26T07:06:05.2481385Z  -e JOB_NAME \ 2024-06-26T07:06:05.2481731Z  -e BASE_SHA \ 2024-06-26T07:06:05.2482084Z  -e BRANCH \ 2024-06-26T07:06:05.2482421Z  -e SHA1 \ 2024-06-26T07:06:05.2482767Z  -e AWS_DEFAULT_REGION \ 2024-06-26T07:06:05.2483175Z  -e IN_WHEEL_TEST \ 2024-06-26T07:06:05.2483559Z  -e SHARD_NUMBER \ 2024-06-26T07:06:05.2483927Z  -e TEST_CONFIG \ 2024-06-26T07:06:05.2484302Z  -e NUM_TEST_SHARDS \ 2024-06-26T07:06:05.2484838Z  -e REENABLED_ISSUES \ 2024-06-26T07:06:05.2485251Z  -e CONTINUE_THROUGH_ERROR \ 2024-06-26T07:06:05.2485680Z  -e VERBOSE_TEST_LOGS \ 2024-06-26T07:06:05.2486094Z  -e NO_TEST_TIMEOUT \ 2024-06-26T07:06:05.2486466Z  -e NO_TD \ 2024-06-26T07:06:05.2486816Z  -e TD_DISTRIBUTED \ 2024-06-26T07:06:05.2487201Z  -e PR_LABELS \ 2024-06-26T07:06:05.2487603Z  -e MAX_JOBS="$(nproc --ignore=2)" \ 2024-06-26T07:06:05.2488067Z  -e SCCACHE_BUCKET \ 2024-06-26T07:06:05.2488467Z  -e SCCACHE_S3_KEY_PREFIX \ 2024-06-26T07:06:05.2488873Z  -e XLA_CUDA \ 2024-06-26T07:06:05.2489282Z  -e XLA_CLANG_CACHE_S3_BUCKET_NAME \ 2024-06-26T07:06:05.2489800Z  -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \ 2024-06-26T07:06:05.2490320Z  -e PYTORCH_TEST_RERUN_DISABLED_TESTS \ 2024-06-26T07:06:05.2490836Z  -e SKIP_SCCACHE_INITIALIZATION=1 \ 2024-06-26T07:06:05.2491305Z  -e HUGGING_FACE_HUB_TOKEN \ 2024-06-26T07:06:05.2491726Z  -e DASHBOARD_TAG \ 2024-06-26T07:06:05.2492199Z  --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \ 2024-06-26T07:06:05.2492890Z  --security-opt seccomp=unconfined \ 2024-06-26T07:06:05.2493367Z  --cap-add=SYS_PTRACE \ 2024-06-26T07:06:05.2493763Z  --ipc=host \ 2024-06-26T07:06:05.2494129Z  --shm-size="${SHM_SIZE}" \ 2024-06-26T07:06:05.2494536Z  --tty \ 2024-06-26T07:06:05.2494855Z  --detach \ 2024-06-26T07:06:05.2495221Z  --name="${container_name}" \ 2024-06-26T07:06:05.2495651Z  --user jenkins \ 2024-06-26T07:06:05.2496205Z  -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ 2024-06-26T07:06:05.2496818Z  -w /var/lib/jenkins/workspace \ 2024-06-26T07:06:05.2497288Z  "${DOCKER_IMAGE}" 2024-06-26T07:06:05.2497651Z ) 2024-06-26T07:06:05.2498081Z # Propagate download.pytorch.org IP to container 2024-06-26T07:06:05.2499099Z grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts" 2024-06-26T07:06:05.2500170Z echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}" 2024-06-26T07:06:05.2501176Z docker exec -t "${container_name}" sh -c "pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}" 2024-06-26T07:06:05.2508449Z shell: /usr/bin/bash -e {0} 2024-06-26T07:06:05.2508798Z env: 2024-06-26T07:06:05.2509068Z GIT_DEFAULT_BRANCH: main 2024-06-26T07:06:05.2509522Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:05.2510137Z BUILD_ENVIRONMENT: linux-focal-cuda12.1-py3.10-gcc9-sm86 2024-06-26T07:06:05.2510644Z PR_NUMBER: 129470 2024-06-26T07:06:05.2510989Z GITHUB_REPOSITORY: pytorch/pytorch 2024-06-26T07:06:05.2511407Z GITHUB_WORKFLOW: pull 2024-06-26T07:06:05.2511737Z GITHUB_JOB: test 2024-06-26T07:06:05.2512057Z GITHUB_RUN_ID: 9673645538 2024-06-26T07:06:05.2512416Z GITHUB_RUN_NUMBER: 219936 2024-06-26T07:06:05.2512769Z GITHUB_RUN_ATTEMPT: 1 2024-06-26T07:06:05.2513102Z JOB_ID: 26688880777 2024-06-26T07:06:05.2513786Z JOB_NAME: linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T07:06:05.2514539Z BRANCH: pull/129470 2024-06-26T07:06:05.2514917Z SHA1: b8c4c54d347aa776934c60784e35936878ef18dc 2024-06-26T07:06:05.2515431Z BASE_SHA: 4b9c9a9cc9c9283380a011310ba180c105c3dcb9 2024-06-26T07:06:05.2515889Z TEST_CONFIG: default 2024-06-26T07:06:05.2516227Z SHARD_NUMBER: 5 2024-06-26T07:06:05.2516548Z NUM_TEST_SHARDS: 5 2024-06-26T07:06:05.2516869Z REENABLED_ISSUES: 2024-06-26T07:06:05.2517222Z CONTINUE_THROUGH_ERROR: False 2024-06-26T07:06:05.2517616Z VERBOSE_TEST_LOGS: False 2024-06-26T07:06:05.2517968Z NO_TEST_TIMEOUT: False 2024-06-26T07:06:05.2518312Z NO_TD: False 2024-06-26T07:06:05.2518615Z TD_DISTRIBUTED: False 2024-06-26T07:06:05.2519032Z SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2 2024-06-26T07:06:05.2519529Z SCCACHE_S3_KEY_PREFIX: pull 2024-06-26T07:06:05.2519928Z SHM_SIZE: 2g 2024-06-26T07:06:05.2520897Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:06:05.2521948Z XLA_CUDA: 2024-06-26T07:06:05.2522434Z XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla 2024-06-26T07:06:05.2523051Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: 0 2024-06-26T07:06:05.2523482Z PYTORCH_TEST_RERUN_DISABLED_TESTS: 0 2024-06-26T07:06:05.2523899Z DASHBOARD_TAG: 2024-06-26T07:06:05.2524218Z HUGGING_FACE_HUB_TOKEN: 2024-06-26T07:06:05.2524554Z ##[endgroup] 2024-06-26T07:06:05.2542934Z + [[ default == \m\u\l\t\i\g\p\u ]] 2024-06-26T07:06:05.2543626Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *onnx* ]] 2024-06-26T07:06:05.2544157Z + TEST_COMMAND=.ci/pytorch/test.sh 2024-06-26T07:06:05.2550379Z +++ nproc --ignore=2 2024-06-26T07:06:05.2563653Z ++ docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all -e BUILD_ENVIRONMENT -e PR_NUMBER -e GITHUB_ACTIONS -e GITHUB_REPOSITORY -e GITHUB_WORKFLOW -e GITHUB_JOB -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e JOB_ID -e JOB_NAME -e BASE_SHA -e BRANCH -e SHA1 -e AWS_DEFAULT_REGION -e IN_WHEEL_TEST -e SHARD_NUMBER -e TEST_CONFIG -e NUM_TEST_SHARDS -e REENABLED_ISSUES -e CONTINUE_THROUGH_ERROR -e VERBOSE_TEST_LOGS -e NO_TEST_TIMEOUT -e NO_TD -e TD_DISTRIBUTED -e PR_LABELS -e MAX_JOBS=14 -e SCCACHE_BUCKET -e SCCACHE_S3_KEY_PREFIX -e XLA_CUDA -e XLA_CLANG_CACHE_S3_BUCKET_NAME -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK -e PYTORCH_TEST_RERUN_DISABLED_TESTS -e SKIP_SCCACHE_INITIALIZATION=1 -e HUGGING_FACE_HUB_TOKEN -e DASHBOARD_TAG --env-file=/tmp/github_env_9673645538 --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --ipc=host --shm-size=2g --tty --detach --name= --user jenkins -v /home/ec2-user/actions-runner/_work/pytorch/pytorch:/var/lib/jenkins/workspace -w /var/lib/jenkins/workspace 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T07:06:16.4705653Z + container_name=2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T07:06:16.4706948Z + grep download.pytorch.org /etc/hosts 2024-06-26T07:06:16.4708046Z + docker exec -i 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 sudo bash -c '/bin/cat >> /etc/hosts' 2024-06-26T07:06:16.5371867Z + echo DOCKER_CONTAINER_ID=2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T07:06:16.5373842Z ++ echo dist/torch-2.5.0a0+gitb8c4c54-cp310-cp310-linux_x86_64.whl 2024-06-26T07:06:16.5375768Z + docker exec -t 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 sh -c 'pip install dist/torch-2.5.0a0+gitb8c4c54-cp310-cp310-linux_x86_64.whl[opt-einsum] && .ci/pytorch/test.sh' 2024-06-26T07:06:16.9086417Z Processing ./dist/torch-2.5.0a0+gitb8c4c54-cp310-cp310-linux_x86_64.whl (from torch==2.5.0a0+gitb8c4c54) 2024-06-26T07:06:17.2261266Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (3.13.1) 2024-06-26T07:06:17.2265922Z Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (4.12.2) 2024-06-26T07:06:17.2268347Z Requirement already satisfied: sympy in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (1.12.1) 2024-06-26T07:06:17.2271783Z Requirement already satisfied: networkx in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (2.8.8) 2024-06-26T07:06:17.2274738Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (3.1.4) 2024-06-26T07:06:17.2277684Z Requirement already satisfied: fsspec in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (2024.2.0) 2024-06-26T07:06:17.2303536Z Requirement already satisfied: opt-einsum>=3.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (3.3.0) 2024-06-26T07:06:17.2360297Z Requirement already satisfied: numpy>=1.7 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from opt-einsum>=3.3->torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (1.21.2) 2024-06-26T07:06:17.2785365Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (2.1.5) 2024-06-26T07:06:17.2940362Z Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy->torch==2.5.0a0+gitb8c4c54->torch==2.5.0a0+gitb8c4c54) (1.3.0) 2024-06-26T07:06:18.0775087Z Installing collected packages: torch 2024-06-26T07:06:26.7801415Z Successfully installed torch-2.5.0a0+gitb8c4c54 2024-06-26T07:06:26.8469137Z ++ dirname .ci/pytorch/test.sh 2024-06-26T07:06:26.8472745Z + source .ci/pytorch/common.sh 2024-06-26T07:06:26.8473884Z +++ dirname .ci/pytorch/common.sh 2024-06-26T07:06:26.8478794Z ++ source .ci/pytorch/common_utils.sh 2024-06-26T07:06:26.8479858Z +++ declare -f -t trap_add 2024-06-26T07:06:26.8486657Z ++ set -ex 2024-06-26T07:06:26.8487333Z ++ [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *rocm* ]] 2024-06-26T07:06:26.8488011Z ++ BUILD_TEST_LIBTORCH=0 2024-06-26T07:06:26.8488663Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 != *rocm* ]] 2024-06-26T07:06:26.8490254Z ++ stat -c %u /var/lib/jenkins/workspace 2024-06-26T07:06:26.8500102Z + WORKSPACE_ORIGINAL_OWNER_ID=1000 2024-06-26T07:06:26.8500575Z + trap_add cleanup_workspace EXIT 2024-06-26T07:06:26.8500985Z + trap_add_cmd=cleanup_workspace 2024-06-26T07:06:26.8501356Z + shift 2024-06-26T07:06:26.8501635Z + for trap_add_name in "$@" 2024-06-26T07:06:26.8507522Z +++ trap -p EXIT 2024-06-26T07:06:26.8509553Z ++ eval 'extract_trap_cmd ' 2024-06-26T07:06:26.8510093Z +++ extract_trap_cmd 2024-06-26T07:06:26.8510606Z +++ printf '%s\n' '' 2024-06-26T07:06:26.8511175Z ++ printf '%s\n' cleanup_workspace 2024-06-26T07:06:26.8511765Z + trap -- ' 2024-06-26T07:06:26.8512111Z cleanup_workspace' EXIT 2024-06-26T07:06:26.8512571Z + sudo chown -R jenkins /var/lib/jenkins/workspace 2024-06-26T07:06:27.3041231Z + git config --global --add safe.directory /var/lib/jenkins/workspace 2024-06-26T07:06:27.3053803Z + echo 'Environment variables:' 2024-06-26T07:06:27.3054222Z Environment variables: 2024-06-26T07:06:27.3054548Z + env 2024-06-26T07:06:27.3059948Z INSTALLED_DB=yes 2024-06-26T07:06:27.3060430Z NV_LIBCUBLAS_VERSION=12.1.3.1-1 2024-06-26T07:06:27.3061093Z NVIDIA_VISIBLE_DEVICES=all 2024-06-26T07:06:27.3061615Z NV_NVML_DEV_VERSION=12.1.105-1 2024-06-26T07:06:27.3062363Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-06-26T07:06:27.3062952Z CONTINUE_THROUGH_ERROR=False 2024-06-26T07:06:27.3063446Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1 2024-06-26T07:06:27.3063955Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1 2024-06-26T07:06:27.3064534Z BUILD_ENVIRONMENT=linux-focal-cuda12.1-py3.10-gcc9-sm86 2024-06-26T07:06:27.3065033Z HOSTNAME=2bc440a16935 2024-06-26T07:06:27.3066000Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3067021Z GITHUB_ACTION=__self 2024-06-26T07:06:27.3067508Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2024-06-26T07:06:27.3072442Z NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 2024-06-26T07:06:27.3076701Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-12-1=12.1.3.1-1 2024-06-26T07:06:27.3077216Z NV_NVTX_VERSION=12.1.105-1 2024-06-26T07:06:27.3077568Z GITHUB_RUN_NUMBER=219936 2024-06-26T07:06:27.3077914Z TEST_CONFIG=default 2024-06-26T07:06:27.3078255Z GITHUB_REPOSITORY_OWNER_ID=21003710 2024-06-26T07:06:27.3078725Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2024-06-26T07:06:27.3079188Z NV_CUDA_CUDART_DEV_VERSION=12.1.105-1 2024-06-26T07:06:27.3079682Z NV_LIBCUSPARSE_VERSION=12.1.0.106-1 2024-06-26T07:06:27.3080139Z NV_LIBNPP_VERSION=12.1.0.40-1 2024-06-26T07:06:27.3080578Z GITHUB_TRIGGERING_ACTOR=leslie-fang-intel 2024-06-26T07:06:27.3081080Z CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache 2024-06-26T07:06:27.3081556Z GITHUB_REF_TYPE=branch 2024-06-26T07:06:27.3081956Z TORCH_CUDA_ARCH_LIST=Maxwell 2024-06-26T07:06:27.3082563Z NCCL_VERSION=2.17.1-1 2024-06-26T07:06:27.3083000Z BASE_SHA=4b9c9a9cc9c9283380a011310ba180c105c3dcb9 2024-06-26T07:06:27.3083439Z XLA_CUDA= 2024-06-26T07:06:27.3083725Z HUGGING_FACE_HUB_TOKEN= 2024-06-26T07:06:27.3084517Z *** 2024-06-26T07:06:27.3085127Z CARGO_NET_GIT_FETCH_WITH_CLI=true 2024-06-26T07:06:27.3085605Z GITHUB_REPOSITORY_ID=65600975 2024-06-26T07:06:27.3085975Z GITHUB_ACTIONS=true 2024-06-26T07:06:27.3086303Z NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:27.3086995Z NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-1=12.1.105-1 2024-06-26T07:06:27.3087541Z NV_LIBNPP_PACKAGE=libnpp-12-1=12.1.0.40-1 2024-06-26T07:06:27.3088002Z SHA1=b8c4c54d347aa776934c60784e35936878ef18dc 2024-06-26T07:06:27.3088509Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2024-06-26T07:06:27.3089000Z GITHUB_SHA=4b51b1a62a63a1add1b3a0f7882f5c7dc66b8f8d 2024-06-26T07:06:27.3089868Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/pull.yml@refs/pull/129470/merge 2024-06-26T07:06:27.3090499Z UCC_HOME=/usr 2024-06-26T07:06:27.3090907Z NV_LIBCUBLAS_DEV_VERSION=12.1.3.1-1 2024-06-26T07:06:27.3091354Z VERBOSE_TEST_LOGS=False 2024-06-26T07:06:27.3091689Z NVIDIA_PRODUCT_NAME=CUDA 2024-06-26T07:06:27.3092146Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-12-1 2024-06-26T07:06:27.3092681Z GITHUB_REF=refs/pull/129470/merge 2024-06-26T07:06:27.3093090Z NV_CUDA_CUDART_VERSION=12.1.105-1 2024-06-26T07:06:27.3093524Z SHARD_NUMBER=5 2024-06-26T07:06:27.3093835Z GITHUB_REF_PROTECTED=false 2024-06-26T07:06:27.3094180Z HOME=/var/lib/jenkins 2024-06-26T07:06:27.3094558Z GITHUB_API_URL=https://api.github.com 2024-06-26T07:06:27.3095047Z PYTORCH_TEST_RERUN_DISABLED_TESTS=0 2024-06-26T07:06:27.3095514Z UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb 2024-06-26T07:06:27.3096083Z SCCACHE_S3_KEY_PREFIX=pull 2024-06-26T07:06:27.3096438Z CUDA_VERSION=12.1.1 2024-06-26T07:06:27.3096993Z NV_LIBCUBLAS_PACKAGE=libcublas-12-1=12.1.3.1-1 2024-06-26T07:06:27.3097433Z NUM_TEST_SHARDS=5 2024-06-26T07:06:27.3097782Z UCX_HOME=/usr 2024-06-26T07:06:27.3098290Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-1=12.1.1-1 2024-06-26T07:06:27.3099362Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3100622Z JOB_NAME=linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T07:06:27.3101857Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3102985Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2024-06-26T07:06:27.3103657Z GITHUB_EVENT_NAME=pull_request 2024-06-26T07:06:27.3104020Z DASHBOARD_TAG= 2024-06-26T07:06:27.3104329Z GITHUB_RUN_ID=9673645538 2024-06-26T07:06:27.3104850Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-1=12.1.0.40-1 2024-06-26T07:06:27.3105531Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-1 2024-06-26T07:06:27.3106635Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3107619Z GITHUB_ACTOR=leslie-fang-intel 2024-06-26T07:06:27.3108031Z NV_LIBNPP_DEV_VERSION=12.1.0.40-1 2024-06-26T07:06:27.3108407Z PR_NUMBER=129470 2024-06-26T07:06:27.3108714Z GITHUB_RUN_ATTEMPT=1 2024-06-26T07:06:27.3109051Z ANACONDA_PYTHON_VERSION=3.10 2024-06-26T07:06:27.3109533Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2024-06-26T07:06:27.3110040Z TERM=xterm 2024-06-26T07:06:27.3110453Z NV_LIBCUSPARSE_DEV_VERSION=12.1.0.106-1 2024-06-26T07:06:27.3110866Z INSTALLED_VISION=yes 2024-06-26T07:06:27.3111187Z BRANCH=pull/129470 2024-06-26T07:06:27.3111510Z OPENSSL_ROOT_DIR=/opt/openssl 2024-06-26T07:06:27.3111942Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2024-06-26T07:06:27.3136515Z CUDA_PATH=/usr/local/cuda 2024-06-26T07:06:27.3137298Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2024-06-26T07:06:27.3138378Z GITHUB_SERVER_URL=https://github.com 2024-06-26T07:06:27.3139017Z UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b 2024-06-26T07:06:27.3139688Z REENABLED_ISSUES= 2024-06-26T07:06:27.3139982Z SHLVL=1 2024-06-26T07:06:27.3140232Z MAX_JOBS=14 2024-06-26T07:06:27.3140550Z NV_CUDA_LIB_VERSION=12.1.1-1 2024-06-26T07:06:27.3140892Z NVARCH=x86_64 2024-06-26T07:06:27.3141354Z GITHUB_ACTOR_ID=53841472 2024-06-26T07:06:27.3141812Z GITHUB_WORKFLOW_SHA=4b51b1a62a63a1add1b3a0f7882f5c7dc66b8f8d 2024-06-26T07:06:27.3142322Z GITHUB_REF_NAME=129470/merge 2024-06-26T07:06:27.3142841Z NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1 2024-06-26T07:06:27.3143702Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2024-06-26T07:06:27.3144710Z GITHUB_JOB=test 2024-06-26T07:06:27.3145098Z NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1 2024-06-26T07:06:27.3145772Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2024-06-26T07:06:27.3146280Z NO_TEST_TIMEOUT=False 2024-06-26T07:06:27.3146596Z TD_DISTRIBUTED=False 2024-06-26T07:06:27.3146966Z NV_CUDA_NSIGHT_COMPUTE_VERSION=12.1.1-1 2024-06-26T07:06:27.3147761Z GITHUB_REPOSITORY=pytorch/pytorch 2024-06-26T07:06:27.3148170Z NV_NVPROF_VERSION=12.1.105-1 2024-06-26T07:06:27.3148801Z GITHUB_RETENTION_DAYS=90 2024-06-26T07:06:27.3149149Z OPENSSL_DIR=/opt/openssl 2024-06-26T07:06:27.3149776Z GITHUB_ACTION_REPOSITORY= 2024-06-26T07:06:27.3151288Z PATH=/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-06-26T07:06:27.3152507Z GITHUB_BASE_REF=gh/leslie-fang-intel/126/base 2024-06-26T07:06:27.3152964Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2024-06-26T07:06:27.3153631Z CI=true 2024-06-26T07:06:27.3153943Z NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1 2024-06-26T07:06:27.3154349Z GITHUB_REPOSITORY_OWNER=pytorch 2024-06-26T07:06:27.3155112Z JOB_ID=26688880777 2024-06-26T07:06:27.3155427Z INSTALLED_PROTOBUF=yes 2024-06-26T07:06:27.3155895Z GITHUB_HEAD_REF=gh/leslie-fang-intel/126/head 2024-06-26T07:06:27.3156333Z GITHUB_ACTION_REF= 2024-06-26T07:06:27.3156767Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2024-06-26T07:06:27.3157224Z GITHUB_WORKFLOW=pull 2024-06-26T07:06:27.3157562Z DEBIAN_FRONTEND=noninteractive 2024-06-26T07:06:27.3158454Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3159278Z NO_TD=False 2024-06-26T07:06:27.3159575Z SKIP_SCCACHE_INITIALIZATION=1 2024-06-26T07:06:27.3159932Z _=/usr/bin/env 2024-06-26T07:06:27.3160394Z ++ python -c 'import site; print(site.getsitepackages()[0])' 2024-06-26T07:06:27.3203288Z + TORCH_INSTALL_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch 2024-06-26T07:06:27.3205068Z + TORCH_BIN_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2024-06-26T07:06:27.3206157Z + TORCH_LIB_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib 2024-06-26T07:06:27.3207221Z + TORCH_TEST_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/test 2024-06-26T07:06:27.3207847Z + BUILD_DIR=build 2024-06-26T07:06:27.3208427Z + BUILD_RENAMED_DIR=build_renamed 2024-06-26T07:06:27.3208810Z + BUILD_BIN_DIR=build/bin 2024-06-26T07:06:27.3209283Z + SHARD_NUMBER=5 2024-06-26T07:06:27.3209579Z + NUM_TEST_SHARDS=5 2024-06-26T07:06:27.3209896Z + export VALGRIND=ON 2024-06-26T07:06:27.3210217Z + VALGRIND=ON 2024-06-26T07:06:27.3210667Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *clang9* ]] 2024-06-26T07:06:27.3211159Z + [[ 0 == \1 ]] 2024-06-26T07:06:27.3211455Z + [[ False == \1 ]] 2024-06-26T07:06:27.3211919Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 != *bazel* ]] 2024-06-26T07:06:27.3212451Z ++ realpath build/custom_test_artifacts 2024-06-26T07:06:27.3214275Z + CUSTOM_TEST_ARTIFACT_BUILD_DIR=/var/lib/jenkins/workspace/build/custom_test_artifacts 2024-06-26T07:06:27.3215262Z + [[ -n '' ]] 2024-06-26T07:06:27.3215600Z + echo 'Environment variables' 2024-06-26T07:06:27.3215977Z Environment variables 2024-06-26T07:06:27.3216292Z + env 2024-06-26T07:06:27.3218491Z INSTALLED_DB=yes 2024-06-26T07:06:27.3219205Z NV_LIBCUBLAS_VERSION=12.1.3.1-1 2024-06-26T07:06:27.3219701Z NVIDIA_VISIBLE_DEVICES=all 2024-06-26T07:06:27.3220239Z NV_NVML_DEV_VERSION=12.1.105-1 2024-06-26T07:06:27.3220932Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-06-26T07:06:27.3221557Z CONTINUE_THROUGH_ERROR=False 2024-06-26T07:06:27.3222049Z NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.1-1+cuda12.1 2024-06-26T07:06:27.3222554Z NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1 2024-06-26T07:06:27.3223235Z BUILD_ENVIRONMENT=linux-focal-cuda12.1-py3.10-gcc9-sm86 2024-06-26T07:06:27.3223737Z HOSTNAME=2bc440a16935 2024-06-26T07:06:27.3224583Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3225565Z GITHUB_ACTION=__self 2024-06-26T07:06:27.3225946Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2024-06-26T07:06:27.3229810Z NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 2024-06-26T07:06:27.3233695Z NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-12-1=12.1.3.1-1 2024-06-26T07:06:27.3234196Z NV_NVTX_VERSION=12.1.105-1 2024-06-26T07:06:27.3234551Z GITHUB_RUN_NUMBER=219936 2024-06-26T07:06:27.3234893Z TEST_CONFIG=default 2024-06-26T07:06:27.3235223Z GITHUB_REPOSITORY_OWNER_ID=21003710 2024-06-26T07:06:27.3235710Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2024-06-26T07:06:27.3236204Z NV_CUDA_CUDART_DEV_VERSION=12.1.105-1 2024-06-26T07:06:27.3236631Z NV_LIBCUSPARSE_VERSION=12.1.0.106-1 2024-06-26T07:06:27.3237051Z NV_LIBNPP_VERSION=12.1.0.40-1 2024-06-26T07:06:27.3237493Z GITHUB_TRIGGERING_ACTOR=leslie-fang-intel 2024-06-26T07:06:27.3237978Z CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache 2024-06-26T07:06:27.3238443Z GITHUB_REF_TYPE=branch 2024-06-26T07:06:27.3238783Z TORCH_CUDA_ARCH_LIST=Maxwell 2024-06-26T07:06:27.3239146Z NCCL_VERSION=2.17.1-1 2024-06-26T07:06:27.3239534Z BASE_SHA=4b9c9a9cc9c9283380a011310ba180c105c3dcb9 2024-06-26T07:06:27.3239974Z XLA_CUDA= 2024-06-26T07:06:27.3240249Z HUGGING_FACE_HUB_TOKEN= 2024-06-26T07:06:27.3240666Z *** 2024-06-26T07:06:27.3240951Z CARGO_NET_GIT_FETCH_WITH_CLI=true 2024-06-26T07:06:27.3241334Z GITHUB_REPOSITORY_ID=65600975 2024-06-26T07:06:27.3241694Z GITHUB_ACTIONS=true 2024-06-26T07:06:27.3242027Z NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T07:06:27.3242500Z NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-1=12.1.105-1 2024-06-26T07:06:27.3243032Z NV_LIBNPP_PACKAGE=libnpp-12-1=12.1.0.40-1 2024-06-26T07:06:27.3243496Z SHA1=b8c4c54d347aa776934c60784e35936878ef18dc 2024-06-26T07:06:27.3243983Z NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev 2024-06-26T07:06:27.3244472Z GITHUB_SHA=4b51b1a62a63a1add1b3a0f7882f5c7dc66b8f8d 2024-06-26T07:06:27.3245428Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/pull.yml@refs/pull/129470/merge 2024-06-26T07:06:27.3246053Z UCC_HOME=/usr 2024-06-26T07:06:27.3246396Z NV_LIBCUBLAS_DEV_VERSION=12.1.3.1-1 2024-06-26T07:06:27.3246801Z VERBOSE_TEST_LOGS=False 2024-06-26T07:06:27.3247133Z NVIDIA_PRODUCT_NAME=CUDA 2024-06-26T07:06:27.3247585Z NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-12-1 2024-06-26T07:06:27.3248056Z GITHUB_REF=refs/pull/129470/merge 2024-06-26T07:06:27.3248462Z NV_CUDA_CUDART_VERSION=12.1.105-1 2024-06-26T07:06:27.3248840Z SHARD_NUMBER=5 2024-06-26T07:06:27.3249148Z GITHUB_REF_PROTECTED=false 2024-06-26T07:06:27.3249486Z HOME=/var/lib/jenkins 2024-06-26T07:06:27.3249846Z GITHUB_API_URL=https://api.github.com 2024-06-26T07:06:27.3250408Z PYTORCH_TEST_RERUN_DISABLED_TESTS=0 2024-06-26T07:06:27.3250875Z UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb 2024-06-26T07:06:27.3251357Z SCCACHE_S3_KEY_PREFIX=pull 2024-06-26T07:06:27.3251707Z CUDA_VERSION=12.1.1 2024-06-26T07:06:27.3252120Z NV_LIBCUBLAS_PACKAGE=libcublas-12-1=12.1.3.1-1 2024-06-26T07:06:27.3252557Z NUM_TEST_SHARDS=5 2024-06-26T07:06:27.3252857Z UCX_HOME=/usr 2024-06-26T07:06:27.3253433Z NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-1=12.1.1-1 2024-06-26T07:06:27.3254516Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3255770Z JOB_NAME=linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) 2024-06-26T07:06:27.3257002Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3258123Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2024-06-26T07:06:27.3258801Z GITHUB_EVENT_NAME=pull_request 2024-06-26T07:06:27.3259167Z DASHBOARD_TAG= 2024-06-26T07:06:27.3259460Z GITHUB_RUN_ID=9673645538 2024-06-26T07:06:27.3259902Z NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-1=12.1.0.40-1 2024-06-26T07:06:27.3260431Z NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-1 2024-06-26T07:06:27.3261411Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3262328Z GITHUB_ACTOR=leslie-fang-intel 2024-06-26T07:06:27.3262735Z NV_LIBNPP_DEV_VERSION=12.1.0.40-1 2024-06-26T07:06:27.3263104Z PR_NUMBER=129470 2024-06-26T07:06:27.3263419Z GITHUB_RUN_ATTEMPT=1 2024-06-26T07:06:27.3263731Z VALGRIND=ON 2024-06-26T07:06:27.3264019Z ANACONDA_PYTHON_VERSION=3.10 2024-06-26T07:06:27.3264459Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2024-06-26T07:06:27.3264921Z TERM=xterm 2024-06-26T07:06:27.3265336Z NV_LIBCUSPARSE_DEV_VERSION=12.1.0.106-1 2024-06-26T07:06:27.3265775Z INSTALLED_VISION=yes 2024-06-26T07:06:27.3266133Z BRANCH=pull/129470 2024-06-26T07:06:27.3266448Z OPENSSL_ROOT_DIR=/opt/openssl 2024-06-26T07:06:27.3266846Z LIBRARY_PATH=/usr/local/cuda/lib64/stubs 2024-06-26T07:06:27.3267270Z CUDA_PATH=/usr/local/cuda 2024-06-26T07:06:27.3268004Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2024-06-26T07:06:27.3268746Z GITHUB_SERVER_URL=https://github.com 2024-06-26T07:06:27.3269234Z UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b 2024-06-26T07:06:27.3269694Z REENABLED_ISSUES= 2024-06-26T07:06:27.3269989Z SHLVL=1 2024-06-26T07:06:27.3270253Z MAX_JOBS=14 2024-06-26T07:06:27.3270555Z NV_CUDA_LIB_VERSION=12.1.1-1 2024-06-26T07:06:27.3270904Z NVARCH=x86_64 2024-06-26T07:06:27.3271204Z GITHUB_ACTOR_ID=53841472 2024-06-26T07:06:27.3271650Z GITHUB_WORKFLOW_SHA=4b51b1a62a63a1add1b3a0f7882f5c7dc66b8f8d 2024-06-26T07:06:27.3272167Z GITHUB_REF_NAME=129470/merge 2024-06-26T07:06:27.3272594Z NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1 2024-06-26T07:06:27.3273240Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2024-06-26T07:06:27.3273838Z GITHUB_JOB=test 2024-06-26T07:06:27.3274224Z NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cuda12.1 2024-06-26T07:06:27.3274765Z LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 2024-06-26T07:06:27.3275273Z NO_TEST_TIMEOUT=False 2024-06-26T07:06:27.3275603Z TD_DISTRIBUTED=False 2024-06-26T07:06:27.3275970Z NV_CUDA_NSIGHT_COMPUTE_VERSION=12.1.1-1 2024-06-26T07:06:27.3276409Z GITHUB_REPOSITORY=pytorch/pytorch 2024-06-26T07:06:27.3276820Z NV_NVPROF_VERSION=12.1.105-1 2024-06-26T07:06:27.3277176Z GITHUB_RETENTION_DAYS=90 2024-06-26T07:06:27.3277523Z OPENSSL_DIR=/opt/openssl 2024-06-26T07:06:27.3277871Z GITHUB_ACTION_REPOSITORY= 2024-06-26T07:06:27.3278894Z PATH=/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-06-26T07:06:27.3280155Z GITHUB_BASE_REF=gh/leslie-fang-intel/126/base 2024-06-26T07:06:27.3280617Z NV_LIBNCCL_PACKAGE_NAME=libnccl2 2024-06-26T07:06:27.3280977Z CI=true 2024-06-26T07:06:27.3281292Z NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1 2024-06-26T07:06:27.3281708Z GITHUB_REPOSITORY_OWNER=pytorch 2024-06-26T07:06:27.3282071Z JOB_ID=26688880777 2024-06-26T07:06:27.3282384Z INSTALLED_PROTOBUF=yes 2024-06-26T07:06:27.3282812Z GITHUB_HEAD_REF=gh/leslie-fang-intel/126/head 2024-06-26T07:06:27.3283244Z GITHUB_ACTION_REF= 2024-06-26T07:06:27.3283718Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2024-06-26T07:06:27.3284195Z GITHUB_WORKFLOW=pull 2024-06-26T07:06:27.3284525Z DEBIAN_FRONTEND=noninteractive 2024-06-26T07:06:27.3285696Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_b50abd87-62e7-4e55-9aeb-0f06e57e7242 2024-06-26T07:06:27.3286514Z NO_TD=False 2024-06-26T07:06:27.3286806Z SKIP_SCCACHE_INITIALIZATION=1 2024-06-26T07:06:27.3287168Z _=/usr/bin/env 2024-06-26T07:06:27.3287514Z + echo 'Testing pytorch' 2024-06-26T07:06:27.3287849Z Testing pytorch 2024-06-26T07:06:27.3288168Z + export LANG=C.UTF-8 2024-06-26T07:06:27.3288502Z + LANG=C.UTF-8 2024-06-26T07:06:27.3288791Z + PR_NUMBER=129470 2024-06-26T07:06:27.3289116Z + [[ default == \d\e\f\a\u\l\t ]] 2024-06-26T07:06:27.3289511Z + export CUDA_VISIBLE_DEVICES=0 2024-06-26T07:06:27.3289887Z + CUDA_VISIBLE_DEVICES=0 2024-06-26T07:06:27.3290240Z + export HIP_VISIBLE_DEVICES=0 2024-06-26T07:06:27.3290615Z + HIP_VISIBLE_DEVICES=0 2024-06-26T07:06:27.3290988Z + [[ default == \d\i\s\t\r\i\b\u\t\e\d ]] 2024-06-26T07:06:27.3291408Z + [[ default == \s\l\o\w ]] 2024-06-26T07:06:27.3291968Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *slow-gradcheck* ]] 2024-06-26T07:06:27.3292652Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *cuda* ]] 2024-06-26T07:06:27.3293203Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2024-06-26T07:06:27.3293677Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2024-06-26T07:06:27.3294095Z + [[ default == *crossref* ]] 2024-06-26T07:06:27.3294610Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *rocm* ]] 2024-06-26T07:06:27.3295235Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *xpu* ]] 2024-06-26T07:06:27.3295867Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 != *-bazel-* ]] 2024-06-26T07:06:27.3296426Z + pip_install --user ninja==1.10.2 2024-06-26T07:06:27.3296950Z + pip install --progress-bar off --user ninja==1.10.2 2024-06-26T07:06:27.6805749Z Collecting ninja==1.10.2 2024-06-26T07:06:27.6975897Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) 2024-06-26T07:06:27.7126475Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) 2024-06-26T07:06:28.5127791Z Installing collected packages: ninja 2024-06-26T07:06:28.5194807Z  WARNING: The script ninja is installed in '/var/lib/jenkins/.local/bin' which is not on PATH. 2024-06-26T07:06:28.5196108Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2024-06-26T07:06:28.5241608Z Successfully installed ninja-1.10.2 2024-06-26T07:06:28.5964436Z + export PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-06-26T07:06:28.5966922Z + PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-06-26T07:06:28.5968480Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *aarch64* ]] 2024-06-26T07:06:28.5968990Z + install_tlparse 2024-06-26T07:06:28.5969377Z + pip_install --user tlparse==0.3.7 2024-06-26T07:06:28.5969906Z + pip install --progress-bar off --user tlparse==0.3.7 2024-06-26T07:06:28.9448245Z Collecting tlparse==0.3.7 2024-06-26T07:06:28.9832774Z Downloading tlparse-0.3.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (346 bytes) 2024-06-26T07:06:28.9958355Z Downloading tlparse-0.3.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.2 MB) 2024-06-26T07:06:29.7886215Z Installing collected packages: tlparse 2024-06-26T07:06:29.8247487Z Successfully installed tlparse-0.3.7 2024-06-26T07:06:29.8983220Z ++ python -m site --user-base 2024-06-26T07:06:29.9136646Z + PATH=/var/lib/jenkins/.local/bin:/var/lib/jenkins/.local/bin:/opt/cache/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2024-06-26T07:06:29.9138251Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *asan* ]] 2024-06-26T07:06:29.9138907Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *-debug* ]] 2024-06-26T07:06:29.9139552Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 != *-bazel-* ]] 2024-06-26T07:06:29.9140490Z + echo 'We are not in debug mode: linux-focal-cuda12.1-py3.10-gcc9-sm86. Expect the assertion to pass' 2024-06-26T07:06:29.9141637Z We are not in debug mode: linux-focal-cuda12.1-py3.10-gcc9-sm86. Expect the assertion to pass 2024-06-26T07:06:29.9142356Z + cd test 2024-06-26T07:06:29.9142879Z + python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2024-06-26T07:06:31.4029569Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]] 2024-06-26T07:06:31.4030112Z + [[ default == \n\o\g\p\u\_\A\V\X\5\1\2 ]] 2024-06-26T07:06:31.4030853Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 != *-bazel-* ]] 2024-06-26T07:06:31.4031425Z + pushd test 2024-06-26T07:06:31.4031794Z ~/workspace/test ~/workspace 2024-06-26T07:06:31.4032361Z ++ python -c 'import torch; print(torch.version.cuda)' 2024-06-26T07:06:32.8764928Z + CUDA_VERSION=12.1 2024-06-26T07:06:32.8765600Z + '[' 12.1 == 12.4 ']' 2024-06-26T07:06:32.8765955Z + ISCUDA124= 2024-06-26T07:06:32.8766242Z + popd 2024-06-26T07:06:32.8766507Z ~/workspace 2024-06-26T07:06:32.8770098Z + DYNAMO_BENCHMARK_FLAGS=() 2024-06-26T07:06:32.8770670Z + [[ default == *dynamo_eager* ]] 2024-06-26T07:06:32.8771083Z + [[ default == *aot_eager* ]] 2024-06-26T07:06:32.8771562Z + [[ default == *aot_inductor* ]] 2024-06-26T07:06:32.8771969Z + [[ default == *inductor* ]] 2024-06-26T07:06:32.8772417Z + [[ default == *dynamic* ]] 2024-06-26T07:06:32.8772792Z + [[ default == *cpu_inductor* ]] 2024-06-26T07:06:32.8773345Z + DYNAMO_BENCHMARK_FLAGS+=(--device cuda) 2024-06-26T07:06:32.8798806Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *libtorch* ]] 2024-06-26T07:06:32.8799634Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *-bazel-* ]] 2024-06-26T07:06:32.8800139Z + cd test 2024-06-26T07:06:32.8800609Z + python -c 'import torch; print(torch.__config__.show())' 2024-06-26T07:06:34.2363504Z PyTorch built with: 2024-06-26T07:06:34.2363903Z - GCC 9.4 2024-06-26T07:06:34.2364315Z - C++ Version: 201703 2024-06-26T07:06:34.2365381Z - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications 2024-06-26T07:06:34.2366470Z - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67) 2024-06-26T07:06:34.2367150Z - OpenMP 201511 (a.k.a. OpenMP 4.5) 2024-06-26T07:06:34.2367659Z - LAPACK is enabled (usually provided by MKL) 2024-06-26T07:06:34.2368131Z - NNPACK is enabled 2024-06-26T07:06:34.2368511Z - CPU capability usage: AVX2 2024-06-26T07:06:34.2368918Z - CUDA Runtime 12.1 2024-06-26T07:06:34.2369429Z - NVCC architecture flags: -gencode;arch=compute_86,code=sm_86 2024-06-26T07:06:34.2370028Z - CuDNN 90.1 (built against CUDA 12.4) 2024-06-26T07:06:34.2370471Z - Magma 2.6.1 2024-06-26T07:06:34.2377844Z - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Werror -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.5.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 2024-06-26T07:06:34.2384782Z 2024-06-26T07:06:34.5136657Z + cd test 2024-06-26T07:06:34.5137869Z + python -c 'import torch; print(torch.__config__.parallel_info())' 2024-06-26T07:06:35.7137360Z ATen/Parallel: 2024-06-26T07:06:35.7137820Z at::get_num_threads() : 8 2024-06-26T07:06:35.7138262Z at::get_num_interop_threads() : 16 2024-06-26T07:06:35.7138821Z OpenMP 201511 (a.k.a. OpenMP 4.5) 2024-06-26T07:06:35.7139311Z omp_get_max_threads() : 8 2024-06-26T07:06:35.7140353Z Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications 2024-06-26T07:06:35.7141172Z mkl_get_max_threads() : 8 2024-06-26T07:06:35.7141788Z Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67) 2024-06-26T07:06:35.7142434Z std::thread::hardware_concurrency() : 16 2024-06-26T07:06:35.7142859Z Environment variables: 2024-06-26T07:06:35.7143211Z OMP_NUM_THREADS : [not set] 2024-06-26T07:06:35.7143583Z MKL_NUM_THREADS : [not set] 2024-06-26T07:06:35.7143951Z ATen parallel backend: OpenMP 2024-06-26T07:06:35.7144209Z 2024-06-26T07:06:35.9560072Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *aarch64* ]] 2024-06-26T07:06:35.9560850Z + [[ default == *backward* ]] 2024-06-26T07:06:35.9561400Z + [[ default == *xla* ]] 2024-06-26T07:06:35.9561876Z + [[ default == *executorch* ]] 2024-06-26T07:06:35.9562368Z + [[ default == \j\i\t\_\l\e\g\a\c\y ]] 2024-06-26T07:06:35.9562961Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *libtorch* ]] 2024-06-26T07:06:35.9563583Z + [[ default == distributed ]] 2024-06-26T07:06:35.9563991Z + [[ default == *inductor_distributed* ]] 2024-06-26T07:06:35.9564496Z + [[ default == *inductor-halide* ]] 2024-06-26T07:06:35.9565382Z + [[ default == *inductor-micro-benchmark* ]] 2024-06-26T07:06:35.9565908Z + [[ default == *huggingface* ]] 2024-06-26T07:06:35.9566293Z + [[ default == *timm* ]] 2024-06-26T07:06:35.9566648Z + [[ default == *torchbench* ]] 2024-06-26T07:06:35.9567119Z + [[ default == *inductor_cpp_wrapper_abi_compatible* ]] 2024-06-26T07:06:35.9567609Z + [[ default == *inductor* ]] 2024-06-26T07:06:35.9567988Z + [[ default == *inductor* ]] 2024-06-26T07:06:35.9568358Z + [[ default == *dynamo* ]] 2024-06-26T07:06:35.9568712Z + [[ default == *dynamo* ]] 2024-06-26T07:06:35.9569222Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 == *rocm* ]] 2024-06-26T07:06:35.9569700Z + [[ 5 == 1 ]] 2024-06-26T07:06:35.9569980Z + [[ 5 == 2 ]] 2024-06-26T07:06:35.9570292Z + [[ 5 -gt 2 ]] 2024-06-26T07:06:35.9570591Z + install_torchvision 2024-06-26T07:06:35.9570912Z + local orig_preload 2024-06-26T07:06:35.9571231Z + local commit 2024-06-26T07:06:35.9571538Z ++ get_pinned_commit vision 2024-06-26T07:06:35.9571917Z ++ cat .github/ci_commit_pins/vision.txt 2024-06-26T07:06:35.9574577Z + commit=d23a6e1664d20707c11781299611436e1f0c104f 2024-06-26T07:06:35.9575052Z + orig_preload= 2024-06-26T07:06:35.9575409Z + '[' -n '' ']' 2024-06-26T07:06:35.9576241Z + pip_install --no-use-pep517 --user git+https://github.com/pytorch/vision.git@d23a6e1664d20707c11781299611436e1f0c104f 2024-06-26T07:06:35.9577677Z + pip install --progress-bar off --no-use-pep517 --user git+https://github.com/pytorch/vision.git@d23a6e1664d20707c11781299611436e1f0c104f 2024-06-26T07:06:36.2623908Z Collecting git+https://github.com/pytorch/vision.git@d23a6e1664d20707c11781299611436e1f0c104f 2024-06-26T07:06:36.2627917Z Cloning https://github.com/pytorch/vision.git (to revision d23a6e1664d20707c11781299611436e1f0c104f) to /tmp/pip-req-build-e1jpvfxa 2024-06-26T07:06:36.2642503Z Running command git clone --filter=blob:none --quiet https://github.com/pytorch/vision.git /tmp/pip-req-build-e1jpvfxa 2024-06-26T07:06:37.7557171Z Running command git rev-parse -q --verify 'sha^d23a6e1664d20707c11781299611436e1f0c104f' 2024-06-26T07:06:37.7576444Z Running command git fetch -q https://github.com/pytorch/vision.git d23a6e1664d20707c11781299611436e1f0c104f 2024-06-26T07:06:38.9604683Z Running command git checkout -q d23a6e1664d20707c11781299611436e1f0c104f 2024-06-26T07:06:39.1878325Z Resolved https://github.com/pytorch/vision.git to commit d23a6e1664d20707c11781299611436e1f0c104f 2024-06-26T07:06:41.4965788Z Preparing metadata (setup.py) ... [?25l- \ done 2024-06-26T07:06:41.5008528Z [?25hRequirement already satisfied: numpy in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torchvision==0.19.0a0+d23a6e1) (1.21.2) 2024-06-26T07:06:41.5011616Z Requirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torchvision==0.19.0a0+d23a6e1) (2.5.0a0+gitb8c4c54) 2024-06-26T07:06:41.5017803Z Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torchvision==0.19.0a0+d23a6e1) (10.3.0) 2024-06-26T07:06:41.5222962Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.19.0a0+d23a6e1) (3.13.1) 2024-06-26T07:06:41.5231453Z Requirement already satisfied: typing-extensions>=4.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.19.0a0+d23a6e1) (4.12.2) 2024-06-26T07:06:41.5233806Z Requirement already satisfied: sympy in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.19.0a0+d23a6e1) (1.12.1) 2024-06-26T07:06:41.5237177Z Requirement already satisfied: networkx in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.19.0a0+d23a6e1) (2.8.8) 2024-06-26T07:06:41.5239749Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.19.0a0+d23a6e1) (3.1.4) 2024-06-26T07:06:41.5242898Z Requirement already satisfied: fsspec in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->torchvision==0.19.0a0+d23a6e1) (2024.2.0) 2024-06-26T07:06:41.5670534Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->torchvision==0.19.0a0+d23a6e1) (2.1.5) 2024-06-26T07:06:41.5833489Z Requirement already satisfied: mpmath<1.4.0,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy->torch->torchvision==0.19.0a0+d23a6e1) (1.3.0) 2024-06-26T07:06:41.5907498Z Building wheels for collected packages: torchvision 2024-06-26T07:07:55.3091834Z Building wheel for torchvision (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2024-06-26T07:07:55.3118206Z [?25h Created wheel for torchvision: filename=torchvision-0.19.0a0+d23a6e1-cp310-cp310-linux_x86_64.whl size=2112859 sha256=a84722e7364ae837b45d9efd016d641397c1a33f40c1077a1013db628c630fee 2024-06-26T07:07:55.3121312Z Stored in directory: /var/lib/jenkins/.cache/pip/wheels/0e/56/35/02931e71eb23fd2b85591c7ec05b733ca7c8b328a2fd151f96 2024-06-26T07:07:55.3154493Z Successfully built torchvision 2024-06-26T07:07:55.9977988Z Installing collected packages: torchvision 2024-06-26T07:07:56.3771080Z Successfully installed torchvision-0.19.0a0+d23a6e1 2024-06-26T07:07:56.4758477Z + '[' -n '' ']' 2024-06-26T07:07:56.4758847Z + test_python_shard 5 2024-06-26T07:07:56.4761878Z + [[ -z 5 ]] 2024-06-26T07:07:56.4762828Z + python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --shard 5 5 --verbose 2024-06-26T07:07:56.5617911Z /var/lib/jenkins/workspace/test/run_test.py:21: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html 2024-06-26T07:07:56.5619037Z import pkg_resources 2024-06-26T07:07:59.8065399Z Downloading https://ossci-metrics.s3.amazonaws.com/slow-tests.json to /var/lib/jenkins/workspace/test/.pytorch-slow-tests.json 2024-06-26T07:07:59.8627640Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2024-06-26T07:07:59.9020203Z Ignoring disabled issues: [''] 2024-06-26T07:07:59.9119213Z Found test times from artifacts 2024-06-26T07:07:59.9483032Z Found test times from artifacts 2024-06-26T07:07:59.9494076Z Running 25% of tests based on TD 2024-06-26T07:07:59.9765069Z Running parallel tests on 3 processes 2024-06-26T07:07:59.9768166Z Name: tests to run (est. time: 54.31min) 2024-06-26T07:07:59.9768782Z Serial tests (0): 2024-06-26T07:07:59.9769192Z Parallel tests (31): 2024-06-26T07:07:59.9769605Z inductor/test_torchinductor_dynamic_shapes 1/5 2024-06-26T07:07:59.9770174Z inductor/test_torchinductor_codegen_dynamic_shapes 2/4 2024-06-26T07:07:59.9770739Z inductor/test_torchinductor 2/5 2024-06-26T07:07:59.9773459Z inductor/test_torchinductor 3/5 2024-06-26T07:07:59.9774054Z inductor/test_torchinductor 5/5 2024-06-26T07:07:59.9774483Z inductor/test_aot_inductor 4/11 2024-06-26T07:07:59.9774996Z inductor/test_aot_inductor 10/11 2024-06-26T07:07:59.9775524Z inductor/test_cuda_cpp_wrapper 2/4 2024-06-26T07:07:59.9776060Z inductor/test_cuda_cpp_wrapper 4/4 2024-06-26T07:07:59.9776714Z inductor/test_torchinductor_opinfo 4/16 2024-06-26T07:07:59.9777341Z inductor/test_torchinductor_opinfo 13/16 2024-06-26T07:07:59.9777996Z inductor/test_torchinductor_opinfo 15/16 2024-06-26T07:07:59.9778493Z inductor/test_inductor_utils 1/1 2024-06-26T07:07:59.9778928Z inductor/test_inplacing_pass 1/1 2024-06-26T07:07:59.9779328Z inductor/test_config 1/1 2024-06-26T07:07:59.9779889Z inductor/test_cpp_wrapper_hipify 1/1 2024-06-26T07:07:59.9780479Z inductor/test_extension_backend 1/1 2024-06-26T07:07:59.9780899Z inductor/test_smoke 1/1 2024-06-26T07:07:59.9781257Z inductor/test_indexing 1/1 2024-06-26T07:07:59.9781650Z inductor/test_group_batch_fusion 1/1 2024-06-26T07:07:59.9782072Z inductor/test_torchbind 1/1 2024-06-26T07:07:59.9782456Z inductor/test_utils 1/1 2024-06-26T07:07:59.9782821Z inductor/test_fx_fusion 1/1 2024-06-26T07:07:59.9783205Z inductor/test_metrics 1/1 2024-06-26T07:07:59.9783578Z inductor/test_triton_kernels 1/1 2024-06-26T07:07:59.9783985Z inductor/test_foreach 1/1 2024-06-26T07:07:59.9784354Z inductor/test_halide 1/1 2024-06-26T07:07:59.9784696Z test_ops 2/7 2024-06-26T07:07:59.9785006Z test_sparse_csr 2/2 2024-06-26T07:07:59.9785326Z test_foreach 1/1 2024-06-26T07:07:59.9785786Z test_transformers 1/3 2024-06-26T07:07:59.9786153Z Name: excluded (est. time: 27.05min) 2024-06-26T07:07:59.9786553Z Serial tests (0): 2024-06-26T07:07:59.9786877Z Parallel tests (45): 2024-06-26T07:07:59.9787215Z functorch/test_ops 3/6 2024-06-26T07:07:59.9787553Z test_decomp 4/13 2024-06-26T07:07:59.9787866Z test_decomp 5/13 2024-06-26T07:07:59.9788192Z test_decomp 6/13 2024-06-26T07:07:59.9788499Z test_meta 3/3 2024-06-26T07:07:59.9788807Z test_ops_fwd_gradients 1/2 2024-06-26T07:07:59.9789192Z test_ops_fwd_gradients 2/2 2024-06-26T07:07:59.9789564Z test_dataloader 1/1 2024-06-26T07:07:59.9789893Z test_binary_ufuncs 1/1 2024-06-26T07:07:59.9790230Z test_cuda 1/1 2024-06-26T07:07:59.9790578Z dynamo/test_inline_inbuilt_nn_modules 1/1 2024-06-26T07:07:59.9790997Z test_jit 1/1 2024-06-26T07:07:59.9791318Z functorch/test_control_flow 1/1 2024-06-26T07:07:59.9791713Z test_matmul_cuda 1/1 2024-06-26T07:07:59.9792052Z dynamo/test_export 1/1 2024-06-26T07:07:59.9792601Z dynamo/test_structured_trace 1/1 2024-06-26T07:07:59.9793046Z dynamo/test_activation_checkpointing 1/1 2024-06-26T07:07:59.9793470Z test_view_ops 1/1 2024-06-26T07:07:59.9793798Z test_sympy_utils 1/1 2024-06-26T07:07:59.9794140Z test_expanded_weights 1/1 2024-06-26T07:07:59.9794502Z functorch/test_ac 1/1 2024-06-26T07:07:59.9794849Z test_nestedtensor 1/1 2024-06-26T07:07:59.9795195Z dynamo/test_backends 1/1 2024-06-26T07:07:59.9795681Z torch_np/numpy_tests/core/test_einsum 1/1 2024-06-26T07:07:59.9796127Z dynamo/test_unspec 1/1 2024-06-26T07:07:59.9796475Z export/test_passes 1/1 2024-06-26T07:07:59.9796861Z dynamo/test_backward_higher_order_ops 1/1 2024-06-26T07:07:59.9797293Z nn/test_embedding 1/1 2024-06-26T07:07:59.9797661Z functorch/test_eager_transforms 1/1 2024-06-26T07:07:59.9798063Z test_ao_sparsity 1/1 2024-06-26T07:07:59.9798398Z test_custom_ops 1/1 2024-06-26T07:07:59.9798758Z dynamo/test_input_attr_tracking 1/1 2024-06-26T07:07:59.9799179Z test_type_promotion 1/1 2024-06-26T07:07:59.9799532Z test_maskedtensor 1/1 2024-06-26T07:07:59.9799877Z dynamo/test_decorators 1/1 2024-06-26T07:07:59.9800282Z dynamo/test_after_aot 1/1 2024-06-26T07:07:59.9800669Z dynamo/test_bytecode_utils 1/1 2024-06-26T07:07:59.9801069Z dynamo/test_autograd_function 1/1 2024-06-26T07:07:59.9801476Z test_segment_reductions 1/1 2024-06-26T07:07:59.9801873Z profiler/test_memory_profiler 1/1 2024-06-26T07:07:59.9802323Z torch_np/numpy_tests/core/test_numeric 1/1 2024-06-26T07:07:59.9802768Z profiler/test_torch_tidy 1/1 2024-06-26T07:07:59.9803146Z test_openmp 1/1 2024-06-26T07:07:59.9803478Z lazy/test_extract_compiled_graph 1/1 2024-06-26T07:07:59.9803889Z lazy/test_bindings 1/1 2024-06-26T07:07:59.9804508Z Starting test batch 'tests to run' 0.0 seconds after initiating testing 2024-06-26T07:07:59.9822761Z Running inductor/test_torchinductor_dynamic_shapes 1/5 ... [2024-06-26 07:07:59.981885] 2024-06-26T07:07:59.9826686Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_dynamic_shapes.py', '-m', 'serial', '--shard-id=1', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:07:59.982247] 2024-06-26T07:08:07.4614028Z 2024-06-26T07:08:07.4616114Z inductor/test_torchinductor_dynamic_shapes 1/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_dynamic_shapes_1.5_5d16a40bb30a1f6d_.log 2024-06-26T07:08:07.4617402Z Running 0 items in this shard: 2024-06-26T07:08:07.4617667Z 2024-06-26T07:08:07.4618343Z Running inductor/test_torchinductor_codegen_dynamic_shapes 2/4 ... [2024-06-26 07:08:07.461543] 2024-06-26T07:08:07.4622478Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_codegen_dynamic_shapes.py', '-m', 'serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:08:07.461854] 2024-06-26T07:08:21.5541766Z 2024-06-26T07:08:21.5543572Z inductor/test_torchinductor_codegen_dynamic_shapes 2/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_codegen_dynamic_shapes_2.4_04d45de13b6b3661_.log 2024-06-26T07:08:21.5546141Z Running 1 items in this shard: test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_large_block_sizes_dynamic_shapes_cuda 2024-06-26T07:08:21.5547154Z 2024-06-26T07:08:21.5547584Z Running inductor/test_torchinductor 2/5 ... [2024-06-26 07:08:21.554245] 2024-06-26T07:08:21.5552748Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor.py', '-m', 'serial', '--shard-id=2', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:08:21.554650] 2024-06-26T07:08:27.2308173Z 2024-06-26T07:08:27.2310190Z inductor/test_torchinductor 2/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_2.5_02d523ab75dc5315_.log 2024-06-26T07:08:27.2311703Z Running 1 items in this shard: test/inductor/test_torchinductor.py::CpuTests::test_large_block_sizes_cpu 2024-06-26T07:08:27.2312342Z 2024-06-26T07:08:27.2312710Z Running inductor/test_torchinductor 3/5 ... [2024-06-26 07:08:27.230886] 2024-06-26T07:08:27.2327439Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor.py', '-m', 'serial', '--shard-id=3', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:08:27.232273] 2024-06-26T07:08:32.9078929Z 2024-06-26T07:08:32.9080339Z inductor/test_torchinductor 3/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_3.5_68c4a191272db99e_.log 2024-06-26T07:08:32.9081550Z Running 0 items in this shard: 2024-06-26T07:08:32.9081900Z 2024-06-26T07:08:32.9083587Z Running inductor/test_torchinductor 5/5 ... [2024-06-26 07:08:32.908055] 2024-06-26T07:08:32.9087956Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor.py', '-m', 'serial', '--shard-id=5', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:08:32.908393] 2024-06-26T07:08:47.6983944Z 2024-06-26T07:08:47.6985765Z inductor/test_torchinductor 5/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_5.5_10872025cba9e9d7_.log 2024-06-26T07:08:47.6987255Z Running 1 items in this shard: test/inductor/test_torchinductor.py::GPUTests::test_large_block_sizes_cuda 2024-06-26T07:08:47.6987904Z 2024-06-26T07:08:47.6988269Z Running inductor/test_aot_inductor 4/11 ... [2024-06-26 07:08:47.698386] 2024-06-26T07:08:47.6991464Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor.py', '-m', 'serial', '--shard-id=4', '--num-shards=11', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:08:47.698749] 2024-06-26T07:08:53.5244526Z 2024-06-26T07:08:53.5247012Z inductor/test_aot_inductor 4/11 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_4.11_d7fa306fe1a4f548_.log 2024-06-26T07:08:53.5248261Z Running 0 items in this shard: 2024-06-26T07:08:53.5248634Z 2024-06-26T07:08:53.5249088Z Running inductor/test_aot_inductor 10/11 ... [2024-06-26 07:08:53.524360] 2024-06-26T07:08:53.5251404Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor.py', '-m', 'serial', '--shard-id=10', '--num-shards=11', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:08:53.524773] 2024-06-26T07:08:59.3497744Z 2024-06-26T07:08:59.3499789Z inductor/test_aot_inductor 10/11 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_10.11_2043ffc964267947_.log 2024-06-26T07:08:59.3500910Z Running 0 items in this shard: 2024-06-26T07:08:59.3501174Z 2024-06-26T07:08:59.3501549Z Running inductor/test_cuda_cpp_wrapper 2/4 ... [2024-06-26 07:08:59.349840] 2024-06-26T07:08:59.3505717Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cuda_cpp_wrapper.py', '-m', 'serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:08:59.350160] 2024-06-26T07:09:05.0748493Z 2024-06-26T07:09:05.0749880Z inductor/test_cuda_cpp_wrapper 2/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cuda_cpp_wrapper_2.4_5c2ded4e6b8c41f0_.log 2024-06-26T07:09:05.0751583Z Running 0 items in this shard: 2024-06-26T07:09:05.0752257Z 2024-06-26T07:09:05.0752714Z Running inductor/test_cuda_cpp_wrapper 4/4 ... [2024-06-26 07:09:05.074929] 2024-06-26T07:09:05.0757248Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cuda_cpp_wrapper.py', '-m', 'serial', '--shard-id=4', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:05.075302] 2024-06-26T07:09:10.8004158Z 2024-06-26T07:09:10.8008239Z inductor/test_cuda_cpp_wrapper 4/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cuda_cpp_wrapper_4.4_f023de498d1dcd20_.log 2024-06-26T07:09:10.8009623Z Running 0 items in this shard: 2024-06-26T07:09:10.8009945Z 2024-06-26T07:09:10.8010443Z Running inductor/test_torchinductor_opinfo 4/16 ... [2024-06-26 07:09:10.800424] 2024-06-26T07:09:10.8012419Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '-m', 'serial', '--shard-id=4', '--num-shards=16', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:10.800809] 2024-06-26T07:09:17.8293941Z 2024-06-26T07:09:17.8296064Z inductor/test_torchinductor_opinfo 4/16 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_4.16_f134ef8362d55d36_.log 2024-06-26T07:09:17.8298028Z Running 0 items in this shard: 2024-06-26T07:09:17.8298410Z 2024-06-26T07:09:17.8298980Z Running inductor/test_torchinductor_opinfo 13/16 ... [2024-06-26 07:09:17.829506] 2024-06-26T07:09:17.8303506Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '-m', 'serial', '--shard-id=13', '--num-shards=16', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:17.829914] 2024-06-26T07:09:24.8071416Z 2024-06-26T07:09:24.8072997Z inductor/test_torchinductor_opinfo 13/16 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_13.16_9a70c8c8f24f3f61_.log 2024-06-26T07:09:24.8074302Z Running 0 items in this shard: 2024-06-26T07:09:24.8074565Z 2024-06-26T07:09:24.8077231Z Running inductor/test_torchinductor_opinfo 15/16 ... [2024-06-26 07:09:24.807419] 2024-06-26T07:09:24.8081488Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '-m', 'serial', '--shard-id=15', '--num-shards=16', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:24.807744] 2024-06-26T07:09:31.7855872Z 2024-06-26T07:09:31.7857809Z inductor/test_torchinductor_opinfo 15/16 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_15.16_1e51e04d7c702944_.log 2024-06-26T07:09:31.7859140Z Running 0 items in this shard: 2024-06-26T07:09:31.7859405Z 2024-06-26T07:09:31.7860191Z Running inductor/test_inductor_utils 1/1 ... [2024-06-26 07:09:31.785702] 2024-06-26T07:09:31.7864616Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_inductor_utils.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:31.786046] 2024-06-26T07:09:34.2553788Z 2024-06-26T07:09:34.2556555Z inductor/test_inductor_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_inductor_utils_1.1_ad70e84a7e428bf8_.log 2024-06-26T07:09:34.2557610Z 2024-06-26T07:09:34.2558052Z Running inductor/test_inplacing_pass 1/1 ... [2024-06-26 07:09:34.255207] 2024-06-26T07:09:34.2560146Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_inplacing_pass.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:34.255542] 2024-06-26T07:09:36.8249000Z 2024-06-26T07:09:36.8250917Z inductor/test_inplacing_pass 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_inplacing_pass_1.1_80b1e15494706a6a_.log 2024-06-26T07:09:36.8252987Z Running 0 items in this shard: 2024-06-26T07:09:36.8253397Z 2024-06-26T07:09:36.8351292Z Running inductor/test_config 1/1 ... [2024-06-26 07:09:36.825011] 2024-06-26T07:09:36.8353954Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_config.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:36.825386] 2024-06-26T07:09:41.4030181Z 2024-06-26T07:09:41.4032124Z inductor/test_config 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_config_1.1_049251ccc3c354eb_.log 2024-06-26T07:09:41.4034459Z Running 0 items in this shard: 2024-06-26T07:09:41.4034842Z 2024-06-26T07:09:41.4035402Z Running inductor/test_cpp_wrapper_hipify 1/1 ... [2024-06-26 07:09:41.402862] 2024-06-26T07:09:41.4038285Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cpp_wrapper_hipify.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:41.403265] 2024-06-26T07:09:44.3237448Z 2024-06-26T07:09:44.3239240Z inductor/test_cpp_wrapper_hipify 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cpp_wrapper_hipify_1.1_9d82448615fcae73_.log 2024-06-26T07:09:44.3240530Z Running 0 items in this shard: 2024-06-26T07:09:44.3240900Z 2024-06-26T07:09:44.3241337Z Running inductor/test_extension_backend 1/1 ... [2024-06-26 07:09:44.323833] 2024-06-26T07:09:44.3245809Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_extension_backend.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:44.324160] 2024-06-26T07:09:49.6480372Z 2024-06-26T07:09:49.6482105Z inductor/test_extension_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_extension_backend_1.1_57ce8c8d87880ebf_.log 2024-06-26T07:09:49.6483530Z Running 0 items in this shard: 2024-06-26T07:09:49.6483809Z 2024-06-26T07:09:49.6484159Z Running inductor/test_smoke 1/1 ... [2024-06-26 07:09:49.648134] 2024-06-26T07:09:49.6488965Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_smoke.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:49.648517] 2024-06-26T07:09:52.1173605Z 2024-06-26T07:09:52.1175083Z inductor/test_smoke 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_smoke_1.1_333d89a53a0a8ead_.log 2024-06-26T07:09:52.1176245Z 2024-06-26T07:09:52.1177239Z Running inductor/test_indexing 1/1 ... [2024-06-26 07:09:52.117394] 2024-06-26T07:09:52.1182047Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_indexing.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:52.117765] 2024-06-26T07:09:56.7903862Z 2024-06-26T07:09:56.7905779Z inductor/test_indexing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_indexing_1.1_c0b3d8d346f79d29_.log 2024-06-26T07:09:56.7907178Z Running 0 items in this shard: 2024-06-26T07:09:56.7907443Z 2024-06-26T07:09:56.7910608Z Running inductor/test_group_batch_fusion 1/1 ... [2024-06-26 07:09:56.790434] 2024-06-26T07:09:56.7912706Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_group_batch_fusion.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:09:56.790763] 2024-06-26T07:10:00.2115180Z 2024-06-26T07:10:00.2116925Z inductor/test_group_batch_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_group_batch_fusion_1.1_0e4afa83a0bf033f_.log 2024-06-26T07:10:00.2118448Z Running 0 items in this shard: 2024-06-26T07:10:00.2118791Z 2024-06-26T07:10:00.2119162Z Running inductor/test_torchbind 1/1 ... [2024-06-26 07:10:00.211521] 2024-06-26T07:10:00.2122776Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchbind.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:00.211870] 2024-06-26T07:10:02.7816456Z 2024-06-26T07:10:02.7817891Z inductor/test_torchbind 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchbind_1.1_ea6b0c226fb66272_.log 2024-06-26T07:10:02.7819067Z Running 0 items in this shard: 2024-06-26T07:10:02.7819409Z 2024-06-26T07:10:02.7820037Z Running inductor/test_utils 1/1 ... [2024-06-26 07:10:02.781695] 2024-06-26T07:10:02.7824933Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_utils.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:02.782105] 2024-06-26T07:10:05.3007672Z 2024-06-26T07:10:05.3009427Z inductor/test_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_utils_1.1_f0f6b9795669181b_.log 2024-06-26T07:10:05.3011136Z Running 0 items in this shard: 2024-06-26T07:10:05.3011538Z 2024-06-26T07:10:05.3012026Z Running inductor/test_fx_fusion 1/1 ... [2024-06-26 07:10:05.300834] 2024-06-26T07:10:05.3016425Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_fx_fusion.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:05.301234] 2024-06-26T07:10:08.6716498Z 2024-06-26T07:10:08.6718190Z inductor/test_fx_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_fx_fusion_1.1_db50191f29b17c70_.log 2024-06-26T07:10:08.6719466Z Running 0 items in this shard: 2024-06-26T07:10:08.6719738Z 2024-06-26T07:10:08.6720122Z Running inductor/test_metrics 1/1 ... [2024-06-26 07:10:08.671676] 2024-06-26T07:10:08.6725805Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_metrics.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:08.672043] 2024-06-26T07:10:11.4920314Z 2024-06-26T07:10:11.4924401Z inductor/test_metrics 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_metrics_1.1_857a03fea532330d_.log 2024-06-26T07:10:11.4927488Z Running 0 items in this shard: 2024-06-26T07:10:11.4927905Z 2024-06-26T07:10:11.4928546Z Running inductor/test_triton_kernels 1/1 ... [2024-06-26 07:10:11.492007] 2024-06-26T07:10:11.4931206Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_triton_kernels.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:11.492554] 2024-06-26T07:10:14.1134329Z 2024-06-26T07:10:14.1136202Z inductor/test_triton_kernels 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_triton_kernels_1.1_3129f1f01b7932c4_.log 2024-06-26T07:10:14.1137810Z Running 0 items in this shard: 2024-06-26T07:10:14.1138191Z 2024-06-26T07:10:14.1138652Z Running inductor/test_foreach 1/1 ... [2024-06-26 07:10:14.112649] 2024-06-26T07:10:14.1141598Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_foreach.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:14.113058] 2024-06-26T07:10:19.6885715Z 2024-06-26T07:10:19.6887105Z inductor/test_foreach 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_foreach_1.1_e8ddf0a34eccf5e8_.log 2024-06-26T07:10:19.6888262Z Running 0 items in this shard: 2024-06-26T07:10:19.6888579Z 2024-06-26T07:10:19.6889245Z Running inductor/test_halide 1/1 ... [2024-06-26 07:10:19.688616] 2024-06-26T07:10:19.6893082Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_halide.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:19.688915] 2024-06-26T07:10:24.1080734Z 2024-06-26T07:10:24.1082350Z inductor/test_halide 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_halide_1.1_cd76d1abf5dd8067_.log 2024-06-26T07:10:24.1083545Z 2024-06-26T07:10:24.1090057Z Running test_ops 2/7 ... [2024-06-26 07:10:24.108764] 2024-06-26T07:10:24.1096746Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops.py', '-m', 'serial', '--shard-id=2', '--num-shards=7', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:24.109309] 2024-06-26T07:10:34.7412212Z 2024-06-26T07:10:34.7414283Z test_ops 2/7 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_2.7_89a369eb891d9f57_.log 2024-06-26T07:10:34.7415216Z Running 0 items in this shard: 2024-06-26T07:10:34.7415473Z 2024-06-26T07:10:34.7415772Z Running test_sparse_csr 2/2 ... [2024-06-26 07:10:34.741098] 2024-06-26T07:10:34.7417952Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sparse_csr.py', '-m', 'serial', '--shard-id=2', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:34.741440] 2024-06-26T07:10:39.6652253Z 2024-06-26T07:10:39.6653937Z test_sparse_csr 2/2 was successful, full logs can be found in artifacts with path test/test-reports/test_sparse_csr_2.2_9210a3ca3f41947b_.log 2024-06-26T07:10:39.6654981Z Running 0 items in this shard: 2024-06-26T07:10:39.6655243Z 2024-06-26T07:10:39.6655538Z Running test_foreach 1/1 ... [2024-06-26 07:10:39.665126] 2024-06-26T07:10:39.6659154Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_foreach.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:39.665411] 2024-06-26T07:10:44.1381358Z 2024-06-26T07:10:44.1383374Z test_foreach 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_foreach_1.1_ed29ca00e5b5073e_.log 2024-06-26T07:10:44.1384467Z Running 0 items in this shard: 2024-06-26T07:10:44.1384741Z 2024-06-26T07:10:44.1385055Z Running test_transformers 1/3 ... [2024-06-26 07:10:44.138119] 2024-06-26T07:10:44.1388065Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_transformers.py', '-m', 'serial', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:10:44.138410] 2024-06-26T07:11:02.8866205Z 2024-06-26T07:11:02.8867718Z test_transformers 1/3 was successful, full logs can be found in artifacts with path test/test-reports/test_transformers_1.3_7fb68648f3e484ac_.log 2024-06-26T07:11:02.8868997Z Running 0 items in this shard: 2024-06-26T07:11:02.8869319Z 2024-06-26T07:11:02.8961723Z Running inductor/test_torchinductor_dynamic_shapes 1/5 ... [2024-06-26 07:11:02.895715] 2024-06-26T07:11:02.8962994Z Running inductor/test_torchinductor_codegen_dynamic_shapes 2/4 ... [2024-06-26 07:11:02.895862] 2024-06-26T07:11:02.8964968Z Running inductor/test_torchinductor 2/5 ... [2024-06-26 07:11:02.896026] 2024-06-26T07:11:02.8967641Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_dynamic_shapes.py', '-m', 'not serial', '--shard-id=1', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:11:02.896305] 2024-06-26T07:11:02.8970756Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_codegen_dynamic_shapes.py', '-m', 'not serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:11:02.896394] 2024-06-26T07:11:02.8973661Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor.py', '-m', 'not serial', '--shard-id=2', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:11:02.896537] 2024-06-26T07:17:37.4106531Z 2024-06-26T07:17:37.4112423Z inductor/test_torchinductor 2/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_2.5_7765004cb3db638e_.log 2024-06-26T07:17:37.4267153Z Running 261 items in this shard: test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast1_broadcast3, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast3_broadcast2, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast3_int, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_dense_broadcast1, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_dense_dense, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_dense_int, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_double_broadcast3, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_double_transposed, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_int_broadcast2, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_strided_broadcast1, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_strided_strided, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_transposed_broadcast3, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_transposed_int, test/inductor/test_torchinductor.py::CpuTests::test__unsafe_masked_index_cpu, test/inductor/test_torchinductor.py::CpuTests::test__unsafe_masked_index_put_accumulate_cpu, test/inductor/test_torchinductor.py::CpuTests::test_adaptive_avg_pool1d_argmax_cpu, test/inductor/test_torchinductor.py::CpuTests::test_adaptive_avg_pool2d2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_adaptive_max_pool2d2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_add_complex3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_add_inplace_permuted_cpu, test/inductor/test_torchinductor.py::CpuTests::test_any_cpu, test/inductor/test_torchinductor.py::CpuTests::test_arange1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_arange3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_arange4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_arange6_cpu, test/inductor/test_torchinductor.py::CpuTests::test_argmax_argmin3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_argmax_argmin_with_duplicates_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool3d_backward2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool3d_backward4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_baddbmm_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bitwise3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bmm2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bool_cpu, test/inductor/test_torchinductor.py::CpuTests::test_both_scalars_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bucketize_cpu, test/inductor/test_torchinductor.py::CpuTests::test_buffer_batch_norm_cpu, test/inductor/test_torchinductor.py::CpuTests::test_builtins_round_float_ndigits_pos_cpu, test/inductor/test_torchinductor.py::CpuTests::test_builtins_round_int_ndigits_zero_cpu, test/inductor/test_torchinductor.py::CpuTests::test_complex_memory_overlap_cpu, test/inductor/test_torchinductor.py::CpuTests::test_concat_add_inplace_cpu, test/inductor/test_torchinductor.py::CpuTests::test_config_option_dont_assume_alignment_recompiles_cpu, test/inductor/test_torchinductor.py::CpuTests::test_constant_pad_float64_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv2d_backward_channels_last_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv2d_channels_last_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv_inference_heuristics_cpu, test/inductor/test_torchinductor.py::CpuTests::test_convolution3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cudnn_rnn_cpu, test/inductor/test_torchinductor.py::CpuTests::test_dist_bf16_cpu, test/inductor/test_torchinductor.py::CpuTests::test_div3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_div9_cpu, test/inductor/test_torchinductor.py::CpuTests::test_div_prim_cpu, test/inductor/test_torchinductor.py::CpuTests::test_dropout_cpu, test/inductor/test_torchinductor.py::CpuTests::test_dtype_mismatch_issue_cpu, test/inductor/test_torchinductor.py::CpuTests::test_empty2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_exp2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fallback_mutable_op_basic_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fallback_mutable_op_no_mutated_tensors_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fft_real_input_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fill1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_float16_to_int16_cpu, test/inductor/test_torchinductor.py::CpuTests::test_float_index_expression_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fractional_max_pool2d3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fuse_tiled_cpu, test/inductor/test_torchinductor.py::CpuTests::test_horizonal_fusion1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_inf_cpu, test/inductor/test_torchinductor.py::CpuTests::test_inplace_where_pointwise_cpu, test/inductor/test_torchinductor.py::CpuTests::test_input_mutation4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_l1_loss_cpu, test/inductor/test_torchinductor.py::CpuTests::test_large_broadcast_reduction_cpu, test/inductor/test_torchinductor.py::CpuTests::test_large_offset_pointwise_cpu, test/inductor/test_torchinductor.py::CpuTests::test_large_strided_reduction_cpu, test/inductor/test_torchinductor.py::CpuTests::test_like_rands_cpu, test/inductor/test_torchinductor.py::CpuTests::test_logcumsumexp_zero_dim_cpu, test/inductor/test_torchinductor.py::CpuTests::test_long_tensor_cpu, test/inductor/test_torchinductor.py::CpuTests::test_mm_mixed_dtype_cpu, test/inductor/test_torchinductor.py::CpuTests::test_multi_threading_cpu, test/inductor/test_torchinductor.py::CpuTests::test_multilayer_any_cpu, test/inductor/test_torchinductor.py::CpuTests::test_multilayer_prime_size_cpu, test/inductor/test_torchinductor.py::CpuTests::test_nan_to_num_cpu, test/inductor/test_torchinductor.py::CpuTests::test_narrow_cpu, test/inductor/test_torchinductor.py::CpuTests::test_new_cpp_build_logical_cpu, test/inductor/test_torchinductor.py::CpuTests::test_new_ones_cpu, test/inductor/test_torchinductor.py::CpuTests::test_nll_loss_forward_cpu, test/inductor/test_torchinductor.py::CpuTests::test_no_mega_fusion_during_lowering_cpu, test/inductor/test_torchinductor.py::CpuTests::test_one_hot_cpu, test/inductor/test_torchinductor.py::CpuTests::test_output_strides_cpu, test/inductor/test_torchinductor.py::CpuTests::test_permute1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_permute2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_airy_ai_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_chebyshev_polynomial_v_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_entr_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_gammaincc_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_log1p_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_modified_bessel_i0_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_scaled_modified_bessel_k0_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_shifted_chebyshev_polynomial_u_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_shifted_chebyshev_polynomial_w_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_xlogy_cpu, test/inductor/test_torchinductor.py::CpuTests::test_randint_cpu, test/inductor/test_torchinductor.py::CpuTests::test_randn_generator_cpu, test/inductor/test_torchinductor.py::CpuTests::test_randn_with_dtype_and_device_cpu, test/inductor/test_torchinductor.py::CpuTests::test_reflection_pad2d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_repeat_interleave_2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_resize_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scatter5_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scatter_add2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sigmoid_cpu, test/inductor/test_torchinductor.py::CpuTests::test_slice2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_softmax_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sqrt_dynamic_shapes_cpu, test/inductor/test_torchinductor.py::CpuTests::test_strided_inputs_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sum3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sum_dtype_cpu, test/inductor/test_torchinductor.py::CpuTests::test_tmp_not_defined_issue2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_to_dtype_cpu, test/inductor/test_torchinductor.py::CpuTests::test_topk_cpu, test/inductor/test_torchinductor.py::CpuTests::test_unbacked_floordiv_simplify_errors_cpu, test/inductor/test_torchinductor.py::CpuTests::test_upsample_nearest2d_backward_cpu, test/inductor/test_torchinductor.py::CpuTests::test_var_correction_cpu, test/inductor/test_torchinductor.py::CpuTests::test_var_mean_cpu, test/inductor/test_torchinductor.py::CpuTests::test_views1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_views5_cpu, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast1_double, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast1_int, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast1_strided, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast2_broadcast1, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast2_int, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_dense_broadcast2, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_dense_broadcast3, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_double_transposed, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_int_dense, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_strided_broadcast1, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_strided_transposed, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_transposed_broadcast1, test/inductor/test_torchinductor.py::GPUTests::test_AllenaiLongformerBase_repro_cuda, test/inductor/test_torchinductor.py::GPUTests::test__unsafe_masked_index_cuda, test/inductor/test_torchinductor.py::GPUTests::test_adaptive_avg_pool_with_output_size_0_cuda, test/inductor/test_torchinductor.py::GPUTests::test_add_complex3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_arange3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_arange6_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool2d2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool2d7_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool2d_backward3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_batch_norm_2d_2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bernoulli2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_both_scalars_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bucketize_default_kwargs_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bucketize_int_cuda, test/inductor/test_torchinductor.py::GPUTests::test_buffer_batch_norm_cuda, test/inductor/test_torchinductor.py::GPUTests::test_buffer_copied_in_graph_cuda, test/inductor/test_torchinductor.py::GPUTests::test_builtins_round_cuda, test/inductor/test_torchinductor.py::GPUTests::test_builtins_round_float_ndigits_pos_cuda, test/inductor/test_torchinductor.py::GPUTests::test_builtins_round_int_ndigits_pos_cuda, test/inductor/test_torchinductor.py::GPUTests::test_builtins_round_int_ndigits_zero_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_inplace_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_single_empty_cuda, test/inductor/test_torchinductor.py::GPUTests::test_complex_fallback_cuda, test/inductor/test_torchinductor.py::GPUTests::test_const_int32_to_float_cuda, test/inductor/test_torchinductor.py::GPUTests::test_constant_pad_3d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_constant_pad_nd_inplace_cuda, test/inductor/test_torchinductor.py::GPUTests::test_conv_functional_bn_fuse_cuda, test/inductor/test_torchinductor.py::GPUTests::test_conv_with_as_strided_cuda, test/inductor/test_torchinductor.py::GPUTests::test_convolution4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cumsum_pattern_matcher_issue_cuda, test/inductor/test_torchinductor.py::GPUTests::test_custom_op_2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_custom_scan_op_cuda, test/inductor/test_torchinductor.py::GPUTests::test_data_type_propogation_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dense_mask_index_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div8_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div_by_zero_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div_precision_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div_prim_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_deterministic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_trivial_1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dtype_mismatch_issue_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dtype_sympy_expr_cuda, test/inductor/test_torchinductor.py::GPUTests::test_elu_cuda, test/inductor/test_torchinductor.py::GPUTests::test_embedding_bag_byte_unpack_cuda, test/inductor/test_torchinductor.py::GPUTests::test_embedding_cuda, test/inductor/test_torchinductor.py::GPUTests::test_erfinv_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_no_mutated_tensors_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_with_return_cuda, test/inductor/test_torchinductor.py::GPUTests::test_float_index_expression_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fmod_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fmod_zero_dim_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fractional_max_pool2d4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_generate_rand_fp8_cuda, test/inductor/test_torchinductor.py::GPUTests::test_grid_sampler_2d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_hardsigmoid_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_propagation_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_propagation_remainder_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_put2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_inplace_activations_cuda, test/inductor/test_torchinductor.py::GPUTests::test_inplace_mixed_dtype_ops_cuda, test/inductor/test_torchinductor.py::GPUTests::test_input_mutation2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_input_mutation3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_isinf_cuda, test/inductor/test_torchinductor.py::GPUTests::test_large_grid_cuda, test/inductor/test_torchinductor.py::GPUTests::test_large_offset_pointwise_cuda, test/inductor/test_torchinductor.py::GPUTests::test_large_pointwise_cuda, test/inductor/test_torchinductor.py::GPUTests::test_large_strided_reduction_cuda, test/inductor/test_torchinductor.py::GPUTests::test_large_tensor_reduction_cuda, test/inductor/test_torchinductor.py::GPUTests::test_like_rands2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_linear2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_linear_mixed_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_linspace2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_linspace3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_list_clearing_cuda, test/inductor/test_torchinductor.py::GPUTests::test_log2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_log_fp64_cuda, test/inductor/test_torchinductor.py::GPUTests::test_masked_fill_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d_with_indices_backward6_cuda, test/inductor/test_torchinductor.py::GPUTests::test_mixed_mm_cuda, test/inductor/test_torchinductor.py::GPUTests::test_mm_views_cuda, test/inductor/test_torchinductor.py::GPUTests::test_nan_to_num_cuda, test/inductor/test_torchinductor.py::GPUTests::test_output_strides_cuda, test/inductor/test_torchinductor.py::GPUTests::test_permute2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pixel_shuffle_channels_last_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_gammaincc_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_hermite_polynomial_h_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_laguerre_polynomial_l_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_log1p_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_modified_bessel_k0_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_spherical_bessel_j0_cuda, test/inductor/test_torchinductor.py::GPUTests::test_profiler_mark_wrapper_call_cuda, test/inductor/test_torchinductor.py::GPUTests::test_reflection_pad2d_backward_cuda, test/inductor/test_torchinductor.py::GPUTests::test_relu_cuda, test/inductor/test_torchinductor.py::GPUTests::test_reuse_buffers_with_aliasing_cuda, test/inductor/test_torchinductor.py::GPUTests::test_roi_align_cuda, test/inductor/test_torchinductor.py::GPUTests::test_rsqrt_cuda, test/inductor/test_torchinductor.py::GPUTests::test_scaled_dot_product_efficient_attention_cuda, test/inductor/test_torchinductor.py::GPUTests::test_scatter6_cuda, test/inductor/test_torchinductor.py::GPUTests::test_scatter_add2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_should_pad_bench_for_bmm_cuda, test/inductor/test_torchinductor.py::GPUTests::test_silu_cuda, test/inductor/test_torchinductor.py::GPUTests::test_single_elem_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice_scatter_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sqrt_dynamic_shapes_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sum1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sum4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sum_int_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tan_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tmp_not_defined_issue2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_transpose_add_cuda, test/inductor/test_torchinductor.py::GPUTests::test_uint4x2_mixed_mm_cuda, test/inductor/test_torchinductor.py::GPUTests::test_unfold_zero_dimension_tensor_cuda, test/inductor/test_torchinductor.py::GPUTests::test_upsample_bilinear2d_a_cuda, test/inductor/test_torchinductor.py::GPUTests::test_upsample_bilinear2d_b_cuda, test/inductor/test_torchinductor.py::GPUTests::test_var_correction_cuda, test/inductor/test_torchinductor.py::GPUTests::test_view_as_complex_cuda, test/inductor/test_torchinductor.py::GPUTests::test_where_broadcast_cuda, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_inductor_sequence_nr, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_numpy_on_gpu, test/inductor/test_torchinductor.py::RNNTest::test_rnn_compile_safe, test/inductor/test_torchinductor.py::NanCheckerTest::test_nan_checker_fail, test/inductor/test_torchinductor.py::NanCheckerTest::test_nan_checker_pass 2024-06-26T07:17:37.4414037Z 2024-06-26T07:17:40.1962005Z Running inductor/test_torchinductor 3/5 ... [2024-06-26 07:17:40.195647] 2024-06-26T07:17:40.1964846Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor.py', '-m', 'not serial', '--shard-id=3', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:17:40.196026] 2024-06-26T07:21:02.7188384Z 2024-06-26T07:21:02.7197439Z inductor/test_torchinductor_codegen_dynamic_shapes 2/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_codegen_dynamic_shapes_2.4_8ea40204817c88d0_.log 2024-06-26T07:21:02.7397946Z Running 309 items in this shard: test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_AllenaiLongformerBase_repro_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test__unsafe_masked_index_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test__unsafe_masked_index_put_accumulate_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_adaptive_avg_pool2d1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_add_complex4_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_add_complex_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_add_inplace_permuted_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_adding_tensor_offsets_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_alexnet_prefix_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_angle_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_any_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_arange1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_argmax_argmin1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_as_strided_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_avg_pool2d_backward4_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_bitwise2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_bitwise3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_bmm1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_buffer_batch_norm_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_buffer_use_after_remove_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_builtins_round_float_ndigits_neg_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_cat_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_cat_empty_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_cat_extern_kernel_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_cat_uint8_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_cat_unbacked_legacy_empty_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_cauchy_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_clamp_type_promotion_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_concat_add_inplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_consecutive_split_cumsum_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_constant_pad_1d_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_constant_pad_float64_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_conv3d_channels_last_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_conv_backward_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_conv_inference_heuristics_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_conv_with_as_strided_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_convolution1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_convolution2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_convolution3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_cumprod_zero_dim_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_custom_op_1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_custom_scan_op_compiled_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_div2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_dropout2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_dropout3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_dropout_trivial_0_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_dtype_sympy_expr_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_elu_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_empty2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_erfinv_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_expanded_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_fallback_mutable_op_with_return_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_fill1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_floordiv_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_fmod_zero_dim_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_forced_buffer_realize_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_fractional_max_pool2d3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_fractional_max_pool2d4_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_getitem_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_glu_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_hardsigmoid_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_hardswish_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_index_propagation_floordiv_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_index_propagation_remainder_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_index_put3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_index_put_fallback1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_index_put_fallback2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_index_put_reinplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_inplace_activations_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_inplace_add_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_inplace_mixed_dtype_ops_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_inplace_resize_as_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_inplace_where_pointwise_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_input_mutation1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_int_input_dynamic_shapes_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_issue102546_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_large_broadcast_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_large_grid_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_large_offset_pointwise_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_layer_norm_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_leaky_relu_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_linear2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_linear_float64_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_list_clearing_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_log1p_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_log2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_log_fp64_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_logcumsumexp_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_max_pool2d1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_max_pool2d2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_max_pool2d_with_indices_backward5_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_min_max_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_min_max_reduction_nan_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_mixed_mm2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_mixed_mm3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_mm_mixed_dtype_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_multilayer_sum_low_prec_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_neg_max_uint8_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_new_cpp_build_logical_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_new_empty_strided_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_nll_loss_forward_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_one_hot_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pad_cast_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_bessel_j0_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_erf_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_erfc_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_exp2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_expm1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_gammaincc_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_modified_bessel_i0_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_multigammaln_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_ndtr_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_polygamma_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_scaled_modified_bessel_k1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_shifted_chebyshev_polynomial_v_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pointwise_xlogy_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pow3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_pow_int_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_rand_like_deterministic_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_reduction1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_reduction2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_remainder_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_repeat_as_strided_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_resize_as_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_round_correctness_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_rsqrt_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_scalar_input_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_scaled_dot_product_efficient_attention_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_scatter2_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_scatter6_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_scatter_bf16_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_scheduler_vertical_fusion1_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_setitem_with_int_parameter_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_sgn_extremal_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_should_pad_bench_for_bmm_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_sigmoid_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_sign_dtype_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_sin_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_slice_scatter_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_softmax_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_softmax_one_kernel_persist_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_split_cumsum_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_split_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_squeeze_varargs_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_sum3_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_sum4_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_sum_int_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_to_device_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_to_dtype_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_triu_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_uint_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_unbacked_floordiv_simplify_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_unbacked_floordiv_simplify_errors_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_unbind_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_unsqueeze_inplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_upsample_nearest2d_backward_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_var_correction_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_views5_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_views7_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_where_broadcast_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_xblock_divides_xnumel_dynamic_shapes_cpu, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_AllenaiLongformerBase_repro_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test__unsafe_masked_index_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_adaptive_avg_pool_with_output_size_0_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_adaptive_max_pool2d1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_add_complex_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_add_inplace_permuted_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_alexnet_prefix_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_any_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_arange2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_arange6_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_argmax_argmin3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_avg_pool2d2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_avg_pool3d_backward3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_bucketize_default_kwargs_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_bucketize_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_buffer_batch_norm_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_builtins_round_float_ndigits_pos_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_builtins_round_float_ndigits_zero_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_cat_negative_dim_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_cat_of_loops_and_extern_kernel_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_cat_uint8_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_cauchy_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_computed_buffer_inlining_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_config_option_dont_assume_alignment_recompiles_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_constant_pad_1d_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_constant_pad_float64_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_constant_pad_nd_inplace_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_conv_bn_fuse_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_conv_inference_heuristics_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_conv_with_as_strided_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_cumprod_zero_dim_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_custom_op_2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_custom_scan_op_compiled_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_custom_scan_op_multi_input_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_dist_bf16_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_div1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_div3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_dtype_mismatch_issue_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_eager_aoti_cache_hit_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_embedding_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_empty2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_empty_strided_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_exp2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_fallback_mutable_op_no_mutated_tensors_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_fallback_mutable_op_with_return_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_fft_real_input_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_float_index_expression_type_promotion_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_floordiv_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_fuse_large_params_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_getitem_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_hardsigmoid_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_horizonal_fusion2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_propagation_device_assert_masked_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_propagation_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_propagation_flip_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_propagation_floordiv_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_put1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_put_fallback1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_put_fallback2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_index_select_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_inplace_mixed_dtype_ops_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_int_input_dynamic_shapes_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_invalid_operand_issue1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_isinf2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_issue102546_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_kernel_names_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_large_pointwise_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_large_tensor_reduction_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_lerp_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_like_rands_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_linear1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_linear2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_linear_float64_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_list_clearing_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_log2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_max_min_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_max_pool2d4_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_max_pool2d_with_indices_backward6_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_mean_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_multi_gpu_device_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_multilayer_var_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_new_empty_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_nll_loss_forward_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pixel_shuffle_channels_last_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_bessel_j1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_chebyshev_polynomial_t_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_chebyshev_polynomial_w_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_expit_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_gammaincc_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_i0_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_log1p_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_modified_bessel_k0_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_psi_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_round_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pointwise_shifted_chebyshev_polynomial_v_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pow3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_pow_int_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_randn_generator_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_randn_like_empty_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_reduction1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_reflection_pad2d_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_relu_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_repeat_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_repeat_interleave_2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_require_stride_expanded_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_resize_as_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_roll_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_rsqrt_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_scalar_output_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_scatter5_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_scatter_reduce1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_scheduler_vertical_fusion1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_sdpa_unaligned_mask_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_sdpa_use_block_ptr_False_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_select_scatter_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_sin_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_sizehint_issue1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_slice3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_slice_mutation3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_slice_scatter2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_slice_scatter_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_slice_scatter_reinplace_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_softmax_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_softmax_one_kernel_persist_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_split_cumprod_low_prec_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_split_cumsum_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_split_cumsum_low_prec_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_sqrt_dynamic_shapes_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_squeeze1_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_sum3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_sum5_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_tanh_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_to_dtype_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_unfold_zero_dimension_tensor_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_unroll_small_reduction_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_unsqueeze_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_unsqueeze_inplace_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_upsample_bilinear2d_a_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_var_mean_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_view_as_complex_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_view_as_real_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_view_detach_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_views2_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_views3_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_views7_dynamic_shapes_cuda, test/inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_zeros_dynamic_shapes_cuda 2024-06-26T07:21:02.7595090Z 2024-06-26T07:21:05.5952217Z Running inductor/test_torchinductor 5/5 ... [2024-06-26 07:21:05.594584] 2024-06-26T07:21:05.5954281Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor.py', '-m', 'not serial', '--shard-id=5', '--num-shards=5', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:21:05.594973] 2024-06-26T07:21:06.3280380Z 2024-06-26T07:21:06.3282264Z inductor/test_torchinductor_dynamic_shapes 1/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_dynamic_shapes_1.5_5f33a46227fce35c_.log 2024-06-26T07:21:06.3423225Z Running 238 items in this shard: test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_max_pool2d1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_add_complex3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_add_const_int_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_addmm_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_aliased_buffer_reuse_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_any_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_arange4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_argmin_with_nan_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_min_int32_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_to_float_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_avg_pool2d4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_avg_pool2d8_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_avg_pool3d_backward4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_avg_pool3d_backward_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_batch_norm_2d_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bitwise3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_bitwise_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_buffer_copied_in_graph_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_buffer_use_after_remove_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_extern_kernel_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cat_negative_dim_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_clamp_type_promotion_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_compar_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_concat_add_inplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_3d_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_constant_pad_float64_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_cumsum_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_custom_op_1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_custom_op_2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_custom_op_3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_custom_op_fixed_layout_sequential_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_custom_scan_op_compiled_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_eager_aoti_with_persistent_cache_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_eager_aoti_with_scalar_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_embedding_bag_byte_unpack_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_empty_strided_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fallback_mutable_op_no_mutated_tensors_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_flip_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_float_index_expression_type_promotion_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_floordiv_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fractional_max_pool2d3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_fractional_max_pool2d4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_full_like_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_gather3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_horizonal_fusion1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_horizonal_fusion2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_index_put_reinplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_indirect_load_broadcast_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_inner_fn_str_and_stride_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_insignificant_strides_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_isinf2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_isinf_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_kernel_names_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_l1_loss_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_large_strided_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_layer_norm_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_like_channels_last_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_like_rands2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_like_rands_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linear2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_linspace2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_log1p_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_log2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_logsumexp_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_max_pool2d_with_indices_backward_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_mm_mixed_dtype_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_move_arange_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_new_ones_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_one_hot_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_output_strides_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_philox_rand_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pixel_shuffle_channels_last_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_bessel_j0_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_chebyshev_polynomial_v_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_erfinv_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_expit_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_gammaln_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_hermite_polynomial_h_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_i1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_laguerre_polynomial_l_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_modified_bessel_i0_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_round_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pointwise_xlogy_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_pow1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_reduction4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_reflection_pad2d_backward_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_reflection_pad2d_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_repeat_interleave_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_roll_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_round_correctness_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scaled_dot_product_attention_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter4_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_add1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_add2_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_scatter_reduce3_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sdpa_unaligned_mask_freezing_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_setitem_with_int_parameter_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sin_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sizehint_issue1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice_mutation1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice_scatter_reinplace_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_slice_view_with_graph_break_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_sort_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_stack_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_topk_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_unroll_small_reduction_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_upsample_cat_conv_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_vdd_clamp_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_view_as_real_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_view_detach_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views1_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views6_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_views7_dynamic_shapes_cpu, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test__unsafe_masked_index_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_addmm_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_arange2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_arange4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d7_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d_backward4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool2d_backward_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_avg_pool3d_backward2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_batch_norm_2d_2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_bernoulli1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_bfloat16_to_int16_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_bmm2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_buffer_use_after_remove_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_builtins_round_float_ndigits_pos_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_builtins_round_float_ndigits_zero_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_builtins_round_int_ndigits_zero_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cat_negative_dim_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cat_unbacked_2d_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_clamp_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_clone_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_complex_fallback_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_complex_memory_overlap_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_computed_buffer_inlining_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_constant_pad_3d_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_conv2d_channels_last_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_conv_backward_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_conv_bn_fuse_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_convolution2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_cudnn_rnn_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_div4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dropout3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_dtype_sympy_expr_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_eager_aoti_with_persistent_cache_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_empty2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_fractional_max_pool2d3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_full_truncation_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_gather2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_getitem_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index_dynamic_shapes_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index_put3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_index_select_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inductor_assert_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inf_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_inplace_resize_as_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_input_mutation4_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_invalid_operand_issue1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_broadcast_reduction_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_pointwise_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_strided_reduction_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_like_rands2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_linear2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_linear_float64_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_log_fp64_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_logcumsumexp_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_long_tensor_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_masked_fill_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_min_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d6_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_max_pool2d_with_indices_backward5_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_multilayer_prime_size_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_narrow_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pad_cast_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_philox_rand_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pointwise_bessel_j1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pointwise_gammaln_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pointwise_i0_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pointwise_modified_bessel_k1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pointwise_multigammaln_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pointwise_shifted_chebyshev_polynomial_w_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pointwise_zeta_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_pow3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_prod_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_rand_like_deterministic_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_randint_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_reduction3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_remove_noop_clone_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_repeat_interleave_2_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_require_stride_expanded_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_scatter_reduce1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_select_scatter_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sgn_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_signbit_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_silu_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_mutation1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_slice_scatter5_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_softmax_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_split_failed_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_split_with_integer_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_stack_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_sum1_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_tensor3_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_device_constant_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_to_dtype_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_topk_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_transpose_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_uint4x2_mixed_mm_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_unbind_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_unfold_zero_dimension_tensor_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_unsqueeze_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_unsqueeze_inplace_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_bilinear2d_b_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_upsample_nearest3d_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_views6_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_where_with_logical_op_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_xblock_divides_xnumel_dynamic_shapes_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_arange_dynamic_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_arithmetic_constant_folding_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_cat_unbacked_duplicate_size_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_full_symbolic_value_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_item_unbacked_stride_nobreak_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op0_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_math_ops_op3_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_nonzero_no_realloc_cuda, test/inductor/test_torchinductor_dynamic_shapes.py::TestInductorDynamicCUDA::test_unwrap_storage_didnt_work_repro_cuda 2024-06-26T07:21:06.3558532Z 2024-06-26T07:21:09.1373447Z Running inductor/test_aot_inductor 4/11 ... [2024-06-26 07:21:09.136713] 2024-06-26T07:21:09.1375725Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor.py', '-m', 'not serial', '--shard-id=4', '--num-shards=11', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:21:09.137048] 2024-06-26T07:26:38.1444523Z 2024-06-26T07:26:38.1450725Z inductor/test_torchinductor 3/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_3.5_f5d0a8dec275c195_.log 2024-06-26T07:26:38.1607069Z Running 267 items in this shard: test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast1_broadcast1, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast1_broadcast2, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast2_int, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast3_broadcast1, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast3_broadcast3, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast3_strided, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_double_broadcast2, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_double_int, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_int_broadcast1, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_transposed_broadcast2, test/inductor/test_torchinductor.py::CpuTests::test_abs_cpu, test/inductor/test_torchinductor.py::CpuTests::test_adaptive_avg_pool2d1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_addmm_cpu, test/inductor/test_torchinductor.py::CpuTests::test_aliased_buffer_reuse_cpu, test/inductor/test_torchinductor.py::CpuTests::test_arange5_cpu, test/inductor/test_torchinductor.py::CpuTests::test_argmax_argmin2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d5_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d_backward4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d_backward_cpu, test/inductor/test_torchinductor.py::CpuTests::test_batch_norm_2d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bitwise_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bucketize_int_cpu, test/inductor/test_torchinductor.py::CpuTests::test_buffer_copied_in_graph_with_different_shapes_cpu, test/inductor/test_torchinductor.py::CpuTests::test_builtins_round_cpu, test/inductor/test_torchinductor.py::CpuTests::test_builtins_round_float_ndigits_zero_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cat_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cat_negative_dim_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cat_unbacked_legacy_empty_cpu, test/inductor/test_torchinductor.py::CpuTests::test_compar_cpu, test/inductor/test_torchinductor.py::CpuTests::test_complex_fallback_cpu, test/inductor/test_torchinductor.py::CpuTests::test_config_option_dont_assume_alignment_cpu, test/inductor/test_torchinductor.py::CpuTests::test_constant_pad_1d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_constant_pad_2d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_constant_pad_3d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv3d_channels_last_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv_backward_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv_bn_fuse_cpu, test/inductor/test_torchinductor.py::CpuTests::test_convolution1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_convolution2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cos_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cumprod_zero_dim_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cumsum_pattern_matcher_issue_cpu, test/inductor/test_torchinductor.py::CpuTests::test_custom_op_1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_custom_op_fixed_layout_channels_last_cpu, test/inductor/test_torchinductor.py::CpuTests::test_custom_op_fixed_layout_sequential_cpu, test/inductor/test_torchinductor.py::CpuTests::test_custom_scan_op_compiled_cpu, test/inductor/test_torchinductor.py::CpuTests::test_data_type_propogation_cpu, test/inductor/test_torchinductor.py::CpuTests::test_dense_mask_index_cpu, test/inductor/test_torchinductor.py::CpuTests::test_div1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_div5_cpu, test/inductor/test_torchinductor.py::CpuTests::test_div_by_zero_cpu, test/inductor/test_torchinductor.py::CpuTests::test_dropout_deterministic_cpu, test/inductor/test_torchinductor.py::CpuTests::test_eager_aoti_support_out_cpu, test/inductor/test_torchinductor.py::CpuTests::test_eager_aoti_with_persistent_cache_cpu, test/inductor/test_torchinductor.py::CpuTests::test_eager_aoti_with_scalar_cpu, test/inductor/test_torchinductor.py::CpuTests::test_elu_cpu, test/inductor/test_torchinductor.py::CpuTests::test_empty1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_expanded_reduction_cpu, test/inductor/test_torchinductor.py::CpuTests::test_expm1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_flip_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fuse_large_params_cpu, test/inductor/test_torchinductor.py::CpuTests::test_gather2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_getitem_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_propagation_abs_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_propagation_nested_indirect_indexing_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_propagation_remainder_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_put4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_put_deterministic_fallback_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_tensor_cpu, test/inductor/test_torchinductor.py::CpuTests::test_inplace_resize_as_cpu, test/inductor/test_torchinductor.py::CpuTests::test_issue102546_cpu, test/inductor/test_torchinductor.py::CpuTests::test_kernel_names_cpu, test/inductor/test_torchinductor.py::CpuTests::test_kwargs_cpu, test/inductor/test_torchinductor.py::CpuTests::test_lerp_cpu, test/inductor/test_torchinductor.py::CpuTests::test_linear1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_linear2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_linear_mixed_dtype_cpu, test/inductor/test_torchinductor.py::CpuTests::test_list_clearing_cpu, test/inductor/test_torchinductor.py::CpuTests::test_log2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d_with_indices_backward4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d_with_indices_backward6_cpu, test/inductor/test_torchinductor.py::CpuTests::test_misaligned_address_issue1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_mixed_mm3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_multi_gpu_device_cpu, test/inductor/test_torchinductor.py::CpuTests::test_multilayer_var_cpu, test/inductor/test_torchinductor.py::CpuTests::test_neg_index_cpu, test/inductor/test_torchinductor.py::CpuTests::test_nonzero_unbacked_refinement_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_bessel_j0_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_bessel_y0_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_expit_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_legendre_polynomial_p_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_shifted_chebyshev_polynomial_v_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_spherical_bessel_j0_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_zeta_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pow2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_reduction3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_reflection_pad2d_backward_cpu, test/inductor/test_torchinductor.py::CpuTests::test_remove_noop_clone_cpu, test/inductor/test_torchinductor.py::CpuTests::test_repeat_as_strided_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scatter_add1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scatter_reduce3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scheduler_vertical_fusion1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sdpa_use_block_ptr_True_cpu, test/inductor/test_torchinductor.py::CpuTests::test_select_scatter_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sign_dtype_cpu, test/inductor/test_torchinductor.py::CpuTests::test_silu_cpu, test/inductor/test_torchinductor.py::CpuTests::test_slice3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_slice_mutation3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_split_cumprod_cpu, test/inductor/test_torchinductor.py::CpuTests::test_tensor3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_tensor_index_slice_cpu, test/inductor/test_torchinductor.py::CpuTests::test_to_device_cpu, test/inductor/test_torchinductor.py::CpuTests::test_transposed_propagates_cpu, test/inductor/test_torchinductor.py::CpuTests::test_uint_cpu, test/inductor/test_torchinductor.py::CpuTests::test_unbacked_floordiv_simplify_cpu, test/inductor/test_torchinductor.py::CpuTests::test_unfold_zero_dimension_tensor_cpu, test/inductor/test_torchinductor.py::CpuTests::test_upsample_cat_conv_cpu, test/inductor/test_torchinductor.py::CpuTests::test_upsample_nearest2d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_upsample_nearest3d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_vertical_fusion1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_where_broadcast_cpu, test/inductor/test_torchinductor.py::CpuTests::test_zero_dim_reductions_cpu, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast1_broadcast1, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast1_broadcast2, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast2_double, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast3_broadcast3, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast3_dense, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast3_transposed, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_dense_int, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_dense_strided, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_double_double, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_int_broadcast3, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_int_double, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_int_int, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_strided_int, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_transposed_broadcast2, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_transposed_double, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_transposed_strided, test/inductor/test_torchinductor.py::GPUTests::test_add_complex4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_add_const_float_cuda, test/inductor/test_torchinductor.py::GPUTests::test_adding_tensor_offsets_cuda, test/inductor/test_torchinductor.py::GPUTests::test_angle_cuda, test/inductor/test_torchinductor.py::GPUTests::test_argmax_argmin3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_argmax_argmin_with_nan_cuda, test/inductor/test_torchinductor.py::GPUTests::test_argmax_to_float_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool2d_backward_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bfloat16_to_int16_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bmm1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bool_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bucketize_add_autotune_cuda, test/inductor/test_torchinductor.py::GPUTests::test_buffer_use_after_remove_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_empty_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_uint8_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_unbacked_2d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_clamp_type_promotion_cuda, test/inductor/test_torchinductor.py::GPUTests::test_clone_cuda, test/inductor/test_torchinductor.py::GPUTests::test_config_option_dont_assume_alignment_cudagraphs_cuda, test/inductor/test_torchinductor.py::GPUTests::test_constant_pad_2d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_convolution3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cos_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cumprod_zero_dim_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cumsum_no_mask_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cumsum_zero_dim_cuda, test/inductor/test_torchinductor.py::GPUTests::test_custom_op_1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_custom_scan_op_compiled_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div7_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div_zero_dim_cuda, test/inductor/test_torchinductor.py::GPUTests::test_dropout_cuda, test/inductor/test_torchinductor.py::GPUTests::test_eager_aoti_support_out_cuda, test/inductor/test_torchinductor.py::GPUTests::test_eager_aoti_with_persistent_cache_cuda, test/inductor/test_torchinductor.py::GPUTests::test_empty2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_empty_strided_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_list_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fill2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_floordiv_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fmin_fmax_cuda, test/inductor/test_torchinductor.py::GPUTests::test_forced_buffer_realize_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fractional_max_pool2d3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_full_boolean_cuda, test/inductor/test_torchinductor.py::GPUTests::test_full_like_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fuse_tiled_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_dynamic_shapes_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_propagation_device_assert_masked_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_propagation_flip_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_put_failed_reinplace_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_put_fallback2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_tensor_cuda, test/inductor/test_torchinductor.py::GPUTests::test_inf_cuda, test/inductor/test_torchinductor.py::GPUTests::test_inner_fn_str_and_stride_cuda, test/inductor/test_torchinductor.py::GPUTests::test_inplace_add_cuda, test/inductor/test_torchinductor.py::GPUTests::test_invalid_operand_issue1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_issue102546_cuda, test/inductor/test_torchinductor.py::GPUTests::test_linear_float64_cuda, test/inductor/test_torchinductor.py::GPUTests::test_linspace1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_logcumsumexp_cuda, test/inductor/test_torchinductor.py::GPUTests::test_logsumexp_cuda, test/inductor/test_torchinductor.py::GPUTests::test_masked_fill_promotion_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d6_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d7_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d_with_indices_backward5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_min_max_reduction_cuda, test/inductor/test_torchinductor.py::GPUTests::test_multi_gpu_recompile_on_index_cuda, test/inductor/test_torchinductor.py::GPUTests::test_multilayer_any_cuda, test/inductor/test_torchinductor.py::GPUTests::test_multilayer_prime_size_cuda, test/inductor/test_torchinductor.py::GPUTests::test_multilayer_sum_low_prec_cuda, test/inductor/test_torchinductor.py::GPUTests::test_multilayer_var_cuda, test/inductor/test_torchinductor.py::GPUTests::test_neg_index_cuda, test/inductor/test_torchinductor.py::GPUTests::test_neg_max_uint8_cuda, test/inductor/test_torchinductor.py::GPUTests::test_new_ones_cuda, test/inductor/test_torchinductor.py::GPUTests::test_one_hot_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_entr_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_erfc_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_exp2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_legendre_polynomial_p_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_psi_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_scaled_modified_bessel_k0_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_shifted_chebyshev_polynomial_t_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_shifted_chebyshev_polynomial_w_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_xlog1py_cuda, test/inductor/test_torchinductor.py::GPUTests::test_polar_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pow1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_randint_kernel_count_cuda, test/inductor/test_torchinductor.py::GPUTests::test_randn_with_dtype_and_device_cuda, test/inductor/test_torchinductor.py::GPUTests::test_reduction2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_reduction3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_reduction5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remainder_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_noop_clone_cuda, test/inductor/test_torchinductor.py::GPUTests::test_repeat_as_strided_cuda, test/inductor/test_torchinductor.py::GPUTests::test_scatter5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_scheduler_vertical_fusion1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sgn_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sizehint_issue1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice_mutation2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice_scatter2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice_view_with_graph_break_cuda, test/inductor/test_torchinductor.py::GPUTests::test_softmax_one_kernel_loop_cuda, test/inductor/test_torchinductor.py::GPUTests::test_split_cumsum_cuda, test/inductor/test_torchinductor.py::GPUTests::test_split_with_unbacked_symints_cuda, test/inductor/test_torchinductor.py::GPUTests::test_stack_cuda, test/inductor/test_torchinductor.py::GPUTests::test_std_cuda, test/inductor/test_torchinductor.py::GPUTests::test_strided_inputs_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tanh_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tensor_index_put_slice_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tmp_not_defined_issue1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_to_device_cuda, test/inductor/test_torchinductor.py::GPUTests::test_upsample_nearest1d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_upsample_nearest3d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_vdd_clamp_cuda, test/inductor/test_torchinductor.py::GPUTests::test_view_on_aliased_cuda, test/inductor/test_torchinductor.py::GPUTests::test_views1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_views7_cuda, test/inductor/test_torchinductor.py::GPUTests::test_xblock_divides_xnumel_cuda, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_divisible_by_16_covers_numel_args, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_evict_last_non_coalesced_loads_block_ptr, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_inductor_detach_view, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_optimize_indexing_dtype_with_constraint, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_split_op_with_sym 2024-06-26T07:26:38.1746720Z 2024-06-26T07:26:40.9645489Z Running inductor/test_aot_inductor 10/11 ... [2024-06-26 07:26:40.963879] 2024-06-26T07:26:40.9648549Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor.py', '-m', 'not serial', '--shard-id=10', '--num-shards=11', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:26:40.964280] 2024-06-26T07:29:09.1746964Z 2024-06-26T07:29:09.1748716Z inductor/test_torchinductor 5/5 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_5.5_c773631a35c5b5f5_.log 2024-06-26T07:29:09.1898253Z Running 279 items in this shard: test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast1_transposed, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast2_dense, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast2_double, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_dense_broadcast2, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_dense_double, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_dense_transposed, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_double_dense, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_int_dense, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_int_int, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_strided_broadcast2, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_strided_dense, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_strided_int, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_strided_transposed, test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_transposed_strided, test/inductor/test_torchinductor.py::CpuTests::test_AllenaiLongformerBase_repro_cpu, test/inductor/test_torchinductor.py::CpuTests::test_adaptive_avg_pool2d_low_prec_cpu, test/inductor/test_torchinductor.py::CpuTests::test_add_complex4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_add_complex_cpu, test/inductor/test_torchinductor.py::CpuTests::test_add_const_float_cpu, test/inductor/test_torchinductor.py::CpuTests::test_adding_tensor_offsets_cpu, test/inductor/test_torchinductor.py::CpuTests::test_angle_cpu, test/inductor/test_torchinductor.py::CpuTests::test_arange2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_argmax_min_int32_cpu, test/inductor/test_torchinductor.py::CpuTests::test_as_strided_cpu, test/inductor/test_torchinductor.py::CpuTests::test_as_strided_scatter_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d6_cpu, test/inductor/test_torchinductor.py::CpuTests::test_avg_pool2d7_cpu, test/inductor/test_torchinductor.py::CpuTests::test_batch_norm_2d_2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bernoulli2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bfloat16_to_int16_cpu, test/inductor/test_torchinductor.py::CpuTests::test_bmm1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_buffer_use_after_remove_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cat_empty_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cat_uint8_cpu, test/inductor/test_torchinductor.py::CpuTests::test_cat_upcasting_cpu, test/inductor/test_torchinductor.py::CpuTests::test_clamp_cpu, test/inductor/test_torchinductor.py::CpuTests::test_clamp_type_promotion_cpu, test/inductor/test_torchinductor.py::CpuTests::test_clone_cpu, test/inductor/test_torchinductor.py::CpuTests::test_consecutive_split_cumsum_cpu, test/inductor/test_torchinductor.py::CpuTests::test_constant_pad_fill_dtype_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv_functional_bn_fuse_cpu, test/inductor/test_torchinductor.py::CpuTests::test_conv_with_as_strided_cpu, test/inductor/test_torchinductor.py::CpuTests::test_convolution4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_custom_op_2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_custom_scan_op_cpu, test/inductor/test_torchinductor.py::CpuTests::test_dist_cpu, test/inductor/test_torchinductor.py::CpuTests::test_div2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_dropout_trivial_0_cpu, test/inductor/test_torchinductor.py::CpuTests::test_eager_aoti_cache_hit_cpu, test/inductor/test_torchinductor.py::CpuTests::test_eager_aoti_override_registration_cpu, test/inductor/test_torchinductor.py::CpuTests::test_embedding_cpu, test/inductor/test_torchinductor.py::CpuTests::test_empty_strided_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fft_real_input_real_output_cpu, test/inductor/test_torchinductor.py::CpuTests::test_floordiv_cpu, test/inductor/test_torchinductor.py::CpuTests::test_fmod_zero_dim_cpu, test/inductor/test_torchinductor.py::CpuTests::test_full_boolean_cpu, test/inductor/test_torchinductor.py::CpuTests::test_gather1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_gather3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_grid_sampler_2d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_hardsigmoid_cpu, test/inductor/test_torchinductor.py::CpuTests::test_hardtanh_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_propagation_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_propagation_flip_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_propagation_floordiv_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_put3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_put_as_masked_fill_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_put_index_cpu, test/inductor/test_torchinductor.py::CpuTests::test_index_put_reinplace_cpu, test/inductor/test_torchinductor.py::CpuTests::test_inductor_layout_optimization_input_mutations_cpu, test/inductor/test_torchinductor.py::CpuTests::test_inner_fn_str_and_stride_cpu, test/inductor/test_torchinductor.py::CpuTests::test_inplace_activations_cpu, test/inductor/test_torchinductor.py::CpuTests::test_inplace_add_cpu, test/inductor/test_torchinductor.py::CpuTests::test_input_mutation1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_input_mutation3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_input_mutation5_cpu, test/inductor/test_torchinductor.py::CpuTests::test_insignificant_strides_cpu, test/inductor/test_torchinductor.py::CpuTests::test_int_input_dynamic_shapes_cpu, test/inductor/test_torchinductor.py::CpuTests::test_layer_norm_cpu, test/inductor/test_torchinductor.py::CpuTests::test_linear_float64_cpu, test/inductor/test_torchinductor.py::CpuTests::test_linspace3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_log_softmax_cpu, test/inductor/test_torchinductor.py::CpuTests::test_masked_scatter_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d7_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d8_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d_with_indices_backward5_cpu, test/inductor/test_torchinductor.py::CpuTests::test_max_pool2d_with_indices_backward_cpu, test/inductor/test_torchinductor.py::CpuTests::test_mean_cpu, test/inductor/test_torchinductor.py::CpuTests::test_mm_views_cpu, test/inductor/test_torchinductor.py::CpuTests::test_multilayer_var_lowp_cpu, test/inductor/test_torchinductor.py::CpuTests::test_nll_loss_backward_cpu, test/inductor/test_torchinductor.py::CpuTests::test_no_op_reduction_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pad_cast_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_bessel_j1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_bessel_y1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_chebyshev_polynomial_u_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_erfinv_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_exp2_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_expm1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_hermite_polynomial_he_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_i1e_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_laguerre_polynomial_l_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_logit_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_ndtr_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_ndtri_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_scaled_modified_bessel_k1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_shifted_chebyshev_polynomial_t_cpu, test/inductor/test_torchinductor.py::CpuTests::test_pointwise_xlog1py_cpu, test/inductor/test_torchinductor.py::CpuTests::test_profiler_mark_wrapper_call_cpu, test/inductor/test_torchinductor.py::CpuTests::test_reduction4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_repeat_cpu, test/inductor/test_torchinductor.py::CpuTests::test_repeat_interleave_cpu, test/inductor/test_torchinductor.py::CpuTests::test_require_stride_expanded_cpu, test/inductor/test_torchinductor.py::CpuTests::test_reuse_buffers_with_aliasing_cpu, test/inductor/test_torchinductor.py::CpuTests::test_roll_cpu, test/inductor/test_torchinductor.py::CpuTests::test_round_correctness_cpu, test/inductor/test_torchinductor.py::CpuTests::test_rsqrt_cpu, test/inductor/test_torchinductor.py::CpuTests::test_rsqrt_dynamic_shapes_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scalar_output_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scaled_dot_product_attention_cpu, test/inductor/test_torchinductor.py::CpuTests::test_scatter1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sdpa_use_block_ptr_False_cpu, test/inductor/test_torchinductor.py::CpuTests::test_simplify_loops_cpu, test/inductor/test_torchinductor.py::CpuTests::test_single_elem_cpu, test/inductor/test_torchinductor.py::CpuTests::test_slice4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_slice_scatter5_cpu, test/inductor/test_torchinductor.py::CpuTests::test_slice_scatter_cpu, test/inductor/test_torchinductor.py::CpuTests::test_slice_view_with_graph_break_cpu, test/inductor/test_torchinductor.py::CpuTests::test_split_cpu, test/inductor/test_torchinductor.py::CpuTests::test_split_cumprod_low_prec_cpu, test/inductor/test_torchinductor.py::CpuTests::test_split_failed_cpu, test/inductor/test_torchinductor.py::CpuTests::test_squeeze1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_squeeze_varargs_cpu, test/inductor/test_torchinductor.py::CpuTests::test_stack_cpu, test/inductor/test_torchinductor.py::CpuTests::test_std_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sum4_cpu, test/inductor/test_torchinductor.py::CpuTests::test_sum_int_cpu, test/inductor/test_torchinductor.py::CpuTests::test_tanh_cpu, test/inductor/test_torchinductor.py::CpuTests::test_tmp_not_defined_issue1_cpu, test/inductor/test_torchinductor.py::CpuTests::test_tmp_not_defined_issue3_cpu, test/inductor/test_torchinductor.py::CpuTests::test_unbind_cpu, test/inductor/test_torchinductor.py::CpuTests::test_unsqueeze_cpu, test/inductor/test_torchinductor.py::CpuTests::test_upsample_nearest1d_cpu, test/inductor/test_torchinductor.py::CpuTests::test_view_on_aliased_cpu, test/inductor/test_torchinductor.py::CpuTests::test_views6_cpu, test/inductor/test_torchinductor.py::CpuTests::test_xblock_divides_xnumel_cpu, test/inductor/test_torchinductor.py::CpuTests::test_zeros_cpu, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast2_strided, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast3_broadcast1, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_broadcast3_strided, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_dense_double, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_double_broadcast2, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_double_strided, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_strided_broadcast2, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_strided_dense, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_strided_double, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_transposed_dense, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_transposed_int, test/inductor/test_torchinductor.py::SweepInputsGPUTest::test_cuda_transposed_transposed, test/inductor/test_torchinductor.py::GPUTests::test_adaptive_avg_pool2d2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_add_const_int_cuda, test/inductor/test_torchinductor.py::GPUTests::test_arange4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_arange5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_argmax_argmin1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_argmax_argmin2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_as_strided_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool2d5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool3d_backward3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool3d_backward4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_avg_pool3d_backward_cuda, test/inductor/test_torchinductor.py::GPUTests::test_bitwise3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_of_loops_and_extern_kernel_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_unbacked_empty_1d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cat_upcasting_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cauchy_cuda, test/inductor/test_torchinductor.py::GPUTests::test_computed_buffer_inlining_cuda, test/inductor/test_torchinductor.py::GPUTests::test_config_option_dont_assume_alignment_cuda, test/inductor/test_torchinductor.py::GPUTests::test_constant_pad_fill_dtype_cuda, test/inductor/test_torchinductor.py::GPUTests::test_conv2d_backward_channels_last_cuda, test/inductor/test_torchinductor.py::GPUTests::test_conv_bn_fuse_cuda, test/inductor/test_torchinductor.py::GPUTests::test_convolution2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_cumsum_cuda, test/inductor/test_torchinductor.py::GPUTests::test_custom_op_fixed_layout_channels_last_cuda, test/inductor/test_torchinductor.py::GPUTests::test_diagonal_copy_cuda, test/inductor/test_torchinductor.py::GPUTests::test_div6_cuda, test/inductor/test_torchinductor.py::GPUTests::test_erfc_cuda, test/inductor/test_torchinductor.py::GPUTests::test_expand_as_cuda, test/inductor/test_torchinductor.py::GPUTests::test_expanded_reduction_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_basic_cuda, test/inductor/test_torchinductor.py::GPUTests::test_float_index_expression_type_promotion_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fractional_max_pool2d1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_functionalize_rng_wrappers_cuda, test/inductor/test_torchinductor.py::GPUTests::test_fuse_large_params_cuda, test/inductor/test_torchinductor.py::GPUTests::test_gather2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_gather3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_getitem_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_propagation_abs_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_put3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_put_as_masked_fill_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_put_index_cuda, test/inductor/test_torchinductor.py::GPUTests::test_index_select_cuda, test/inductor/test_torchinductor.py::GPUTests::test_inductor_layout_optimization_input_mutations_cuda, test/inductor/test_torchinductor.py::GPUTests::test_input_mutation5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_int_input_dynamic_shapes_cuda, test/inductor/test_torchinductor.py::GPUTests::test_isinf2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_l1_loss_cuda, test/inductor/test_torchinductor.py::GPUTests::test_lgamma_cuda, test/inductor/test_torchinductor.py::GPUTests::test_like_rands_cuda, test/inductor/test_torchinductor.py::GPUTests::test_log1p_cuda, test/inductor/test_torchinductor.py::GPUTests::test_log_softmax_cuda, test/inductor/test_torchinductor.py::GPUTests::test_logcumsumexp_zero_dim_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d8_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d_with_indices_backward2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_max_pool2d_with_indices_backward_cuda, test/inductor/test_torchinductor.py::GPUTests::test_mixed_mm2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_mixed_mm3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_move_arange_cuda, test/inductor/test_torchinductor.py::GPUTests::test_multi_gpu_device_cuda, test/inductor/test_torchinductor.py::GPUTests::test_narrow_cuda, test/inductor/test_torchinductor.py::GPUTests::test_no_op_reduction_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pad_view_cuda, test/inductor/test_torchinductor.py::GPUTests::test_philox_rand_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_chebyshev_polynomial_u_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_chebyshev_polynomial_v_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_hermite_polynomial_he_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_i0_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_i0e_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_logit_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_polygamma_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_round_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_shifted_chebyshev_polynomial_u_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_shifted_chebyshev_polynomial_v_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pointwise_sinc_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pow2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_pow3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_randint_cuda, test/inductor/test_torchinductor.py::GPUTests::test_randint_int64_mod_cuda, test/inductor/test_torchinductor.py::GPUTests::test_reduction1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_remove_no_ops_cuda, test/inductor/test_torchinductor.py::GPUTests::test_resize_as_cuda, test/inductor/test_torchinductor.py::GPUTests::test_roll_cuda, test/inductor/test_torchinductor.py::GPUTests::test_scatter_add1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_scatter_reduce2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sdpa_unaligned_mask_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sdpa_use_block_ptr_False_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sgn_extremal_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sin_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice_mutation1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_slice_scatter4_cuda, test/inductor/test_torchinductor.py::GPUTests::test_softmax_one_kernel_persist_cuda, test/inductor/test_torchinductor.py::GPUTests::test_sort_cuda, test/inductor/test_torchinductor.py::GPUTests::test_squeeze2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tensor2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tensor3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_tmp_not_defined_issue3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_transpose_cuda, test/inductor/test_torchinductor.py::GPUTests::test_transposed_propagates_cuda, test/inductor/test_torchinductor.py::GPUTests::test_triu_cuda, test/inductor/test_torchinductor.py::GPUTests::test_unspec_inputs_cuda, test/inductor/test_torchinductor.py::GPUTests::test_unsqueeze_cuda, test/inductor/test_torchinductor.py::GPUTests::test_unsqueeze_inplace_cuda, test/inductor/test_torchinductor.py::GPUTests::test_upsample_nearest2d_backward_cuda, test/inductor/test_torchinductor.py::GPUTests::test_upsample_nearest2d_cuda, test/inductor/test_torchinductor.py::GPUTests::test_vertical_fusion1_cuda, test/inductor/test_torchinductor.py::GPUTests::test_views2_cuda, test/inductor/test_torchinductor.py::GPUTests::test_views3_cuda, test/inductor/test_torchinductor.py::GPUTests::test_views5_cuda, test/inductor/test_torchinductor.py::GPUTests::test_views6_cuda, test/inductor/test_torchinductor.py::GPUTests::test_where_with_logical_op_cuda, test/inductor/test_torchinductor.py::GPUTests::test_zero_dim_reductions_cuda, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_bandwidth_profiler, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_constant_folding_deallocation, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_evict_last_non_coalesced_loads, test/inductor/test_torchinductor.py::TritonCodeGenTests::test_kernel_names_descriptive, test/inductor/test_torchinductor.py::TestFull::test_full_dtype 2024-06-26T07:29:09.2039498Z 2024-06-26T07:29:12.0630097Z Running inductor/test_cuda_cpp_wrapper 2/4 ... [2024-06-26 07:29:12.062406] 2024-06-26T07:29:12.0632757Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cuda_cpp_wrapper.py', '-m', 'not serial', '--shard-id=2', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:29:12.062744] 2024-06-26T07:30:16.0118203Z 2024-06-26T07:30:16.0120091Z inductor/test_aot_inductor 4/11 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_4.11_fd0ee16c73e05668_.log 2024-06-26T07:30:16.0209100Z Running 78 items in this shard: test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_cond_with_outer_code_before_after_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_convolution_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_dup_unbacked_sym_decl_with_refinement_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_duplicate_constant_folding_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_large_grid_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_linear_freezing_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_nested_tensor_from_jagged_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_output_misaligned_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_output_path_2_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_small_constant_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_triton_kernel_grid_type_2_num_dims_2_dynamic_True_autotune_True_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_while_loop_nested_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_buffer_mutation_1_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_buffer_reuse_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_cond_non_tensor_predicates_dynamic_False_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_cond_non_tensor_predicates_dynamic_True_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_freezing_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_large_grid_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_large_mmaped_weights_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_linear_freezing_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_nested_tensor_from_jagged_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_non_default_cuda_device_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_output_path_1_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_pytree_inputs_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_replicate_on_devices_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_return_view_constant_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_sdpa_2_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_triton_kernel_grid_type_2_num_dims_1_dynamic_True_autotune_True_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_triton_kernel_grid_type_2_num_dims_2_dynamic_False_autotune_False_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_triton_kernel_grid_type_2_num_dims_2_dynamic_True_autotune_True_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_triton_kernel_unbacked_symint_in_grid_dynamic_True_autotuning_True_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_add_complex_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_aliased_buffer_reuse_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_cond_non_tensor_predicates_dynamic_False_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_index_put_with_none_index_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_large_weight_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_linear_freezing_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_nested_tensor_from_jagged_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_path_2_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_runtime_checks_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_triton_kernel_grid_type_3_num_dims_2_dynamic_True_autotune_False_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_triton_kernel_unbacked_symint_in_grid_dynamic_False_autotuning_True_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_view_outputs_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_zero_size_weight_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_add_complex_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_duplicate_constant_folding_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_output_path_2_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_runtime_checks_complex_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_runtime_checks_dtype_failed_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_runtime_checks_fp8_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_grid_type_1_num_dims_1_dynamic_True_autotune_False_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_grid_type_3_num_dims_1_dynamic_False_autotune_False_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_while_loop_with_outer_code_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_addmm_multiple_dynamic_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_assert_async_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_bmm_multiple_dynamic_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_buffer_mutation_2_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_cond_nested_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_cond_non_tensor_predicates_dynamic_True_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_conv_freezing_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_multi_device_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_no_args_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_runtime_checks_shape_failed_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_scatter_fallback_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_scatter_reduce_fallback_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_triton_kernel_grid_type_2_num_dims_2_dynamic_False_autotune_False_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_triton_kernel_grid_type_2_num_dims_2_dynamic_False_autotune_True_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_triton_kernel_grid_type_3_num_dims_1_dynamic_False_autotune_False_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_while_loop_nested_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_convolution_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_nested_tensor_from_jagged_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_output_path_2_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_repeat_output_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_scatter_reduce_fallback_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_triton_kernel_grid_type_2_num_dims_1_dynamic_False_autotune_False_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_triton_kernel_grid_type_3_num_dims_1_dynamic_False_autotune_False_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_with_offset_non_abi_compatible_cuda 2024-06-26T07:30:16.0297891Z 2024-06-26T07:30:18.8546061Z Running inductor/test_cuda_cpp_wrapper 4/4 ... [2024-06-26 07:30:18.854031] 2024-06-26T07:30:18.8548701Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cuda_cpp_wrapper.py', '-m', 'not serial', '--shard-id=4', '--num-shards=4', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:30:18.854421] 2024-06-26T07:36:12.3443065Z 2024-06-26T07:36:12.3445413Z inductor/test_aot_inductor 10/11 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_10.11_ff5c2860c2121ea8_.log 2024-06-26T07:36:12.3505915Z Running 81 items in this shard: test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_add_complex_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_bmm_multiple_dynamic_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_constant_folding_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_fqn_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_index_put_with_none_index_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_multiple_output_alias_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_non_tensor_input_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_replicate_on_devices_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_runtime_checks_shape_failed_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_sdpa_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_simple_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_triton_kernel_equal_to_1_float_arg_dynamic_False_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_triton_kernel_equal_to_1_float_arg_dynamic_True_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_triton_kernel_grid_type_1_num_dims_1_dynamic_False_autotune_False_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_triton_kernel_reinterpret_view_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_with_no_triton_profiler_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_with_offset_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_zero_grid_with_unbacked_symbols_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_add_complex_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_buffer_mutation_3_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_cond_simple_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_cond_use_buffers_from_outer_scope_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_deconv_freezing_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_duplicate_constant_folding_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_dynamic_smem_above_default_limit_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fake_tensor_device_validation_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fft_c2c_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_fqn_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_multi_device_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_scaled_dot_product_efficient_attention_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_small_constant_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocation::test_triton_kernel_unbacked_symint_in_grid_dynamic_True_autotuning_False_abi_compatible_cpu_with_stack_allocation, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_inf_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_output_misaligned_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_reuse_kernel_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_runtime_checks_complex_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_runtime_checks_shape_failed_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_triton_kernel_grid_type_1_num_dims_1_dynamic_True_autotune_True_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_triton_kernel_grid_type_2_num_dims_1_dynamic_False_autotune_True_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_triton_kernel_grid_type_2_num_dims_2_dynamic_True_autotune_True_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_triton_kernel_grid_type_3_num_dims_1_dynamic_True_autotune_False_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_triton_kernel_unbacked_symint_in_grid_dynamic_True_autotuning_False_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpuWithStackAllocationAndMinimalArrayRefInterface::test_while_loop_nested_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_buffer_mutation_1_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_cond_use_buffers_from_outer_scope_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_cond_with_reinterpret_view_inputs_outputs_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_fft_c2c_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_large_grid_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_multi_device_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_nan_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_scatter_reduce_fallback_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_equal_to_1_float_arg_dynamic_True_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_extern_kernel_arg_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_with_none_input_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_view_outputs_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_buffer_mutation_4_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_cond_simple_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_cond_with_parameters_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_convolution_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_deconv_freezing_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_linear_freezing_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_nan_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_poi_multiple_dynamic_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_runtime_checks_dtype_failed_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_triton_kernel_grid_type_1_num_dims_2_dynamic_True_autotune_True_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_triton_kernel_unbacked_symint_in_grid_dynamic_True_autotuning_False_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_while_loop_simple_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_zero_grid_with_unbacked_symbols_non_abi_compatible_cpu, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_addmm_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_cond_use_buffers_from_outer_scope_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_dynamic_scalar_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_linear_freezing_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_poi_multiple_dynamic_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_repeat_interleave_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_runtime_checks_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_runtime_checks_shape_failed_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_simple_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_triton_kernel_grid_type_2_num_dims_1_dynamic_True_autotune_True_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_triton_kernel_reinterpret_view_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_while_loop_simple_non_abi_compatible_cuda, test/inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCuda::test_while_loop_with_parameters_non_abi_compatible_cuda 2024-06-26T07:36:12.3571028Z 2024-06-26T07:36:15.1879233Z Running inductor/test_torchinductor_opinfo 4/16 ... [2024-06-26 07:36:15.187351] 2024-06-26T07:36:15.1881546Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '-m', 'not serial', '--shard-id=4', '--num-shards=16', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:36:15.187679] 2024-06-26T07:37:24.4441155Z 2024-06-26T07:37:24.4446708Z inductor/test_cuda_cpp_wrapper 4/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cuda_cpp_wrapper_4.4_4fa77891ab73d656_.log 2024-06-26T07:37:24.4463133Z Running 19 items in this shard: test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_adding_tensor_offsets_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_bitwise_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_bmm1_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_cat_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_custom_op_1_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_roi_align_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_scalar_input_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_scaled_dot_product_efficient_attention_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_silu_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_layer_norm_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_linear1_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_multi_device_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_multi_threading_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_randint_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_relu_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_scaled_dot_product_efficient_attention_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_silu_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_sort_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_sum_int_cuda_dynamic_shapes_cuda_wrapper 2024-06-26T07:37:24.4479331Z 2024-06-26T07:37:27.3566898Z Running inductor/test_torchinductor_opinfo 13/16 ... [2024-06-26 07:37:27.355900] 2024-06-26T07:37:27.3569869Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '-m', 'not serial', '--shard-id=13', '--num-shards=16', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:37:27.356278] 2024-06-26T07:37:33.6619241Z 2024-06-26T07:37:33.6621241Z inductor/test_cuda_cpp_wrapper 2/4 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cuda_cpp_wrapper_2.4_df38bdac7176431b_.log 2024-06-26T07:37:33.6636857Z Running 26 items in this shard: test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_buffer_use_after_remove_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_convolution1_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_custom_op_2_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_fft_real_input_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_index_tensor_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_linear_relu_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_mm_views_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_multi_device_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_multi_threading_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_randint_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_reduction1_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_repeat_interleave_2_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_scaled_dot_product_attention_cuda_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_adding_tensor_offsets_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_batch_norm_2d_2_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_bernoulli1_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_conv_backward_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_convolution1_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_custom_op_1_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_custom_op_2_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_custom_op_3_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_fft_real_input_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_fft_real_input_real_output_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_inductor_layout_optimization_input_mutations_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_scalar_input_cuda_dynamic_shapes_cuda_wrapper, test/inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_scaled_dot_product_attention_cuda_dynamic_shapes_cuda_wrapper 2024-06-26T07:37:33.6651936Z 2024-06-26T07:37:36.6056548Z Running inductor/test_torchinductor_opinfo 15/16 ... [2024-06-26 07:37:36.605116] 2024-06-26T07:37:36.6059055Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_opinfo.py', '-m', 'not serial', '--shard-id=15', '--num-shards=16', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:37:36.605454] 2024-06-26T07:44:32.2010679Z 2024-06-26T07:44:32.2014687Z inductor/test_torchinductor_opinfo 15/16 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_15.16_6404d1a0ec63af14_.log 2024-06-26T07:44:32.2220744Z Running 220 items in this shard: test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_T_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive___radd___cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive___ror___cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__batch_norm_with_update_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__unsafe_masked_index_put_accumulate_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__upsample_bilinear2d_aa_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_addcmul_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_addcmul_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_addcmul_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_allclose_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_aminmax_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_angle_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_angle_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_argwhere_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_as_strided_partial_views_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_as_strided_scatter_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_asinh_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_atan_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_atanh_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bfloat16_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bincount_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bitwise_not_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bitwise_right_shift_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_broadcast_tensors_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bucketize_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cartesian_prod_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cartesian_prod_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cauchy_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cdouble_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_chunk_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_clamp_max_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_clamp_max_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_clamp_min_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_combinations_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_conj_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_constant_pad_nd_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_constant_pad_nd_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_constant_pad_nd_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_constant_pad_nd_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_contiguous_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_corrcoef_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cosh_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cumprod_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cumsum_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cumulative_trapezoid_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_deg2rad_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diag_embed_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diff_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_div_floor_rounding_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_div_no_rounding_mode_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_dsplit_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_dstack_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_permuted_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_strided_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_eq_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_erfc_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_erfinv_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_hfftn_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ifft2_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ifft_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ifft_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ifftshift_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_rfft2_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_rfft_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_rfftn_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_rfftn_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_flatten_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_flatten_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fliplr_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_float_power_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_float_power_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_gather_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ge_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_gt_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_hstack_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_i0_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_int_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isin_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isreal_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_item_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_4inputs_with_extra_args_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_4inputs_with_extra_args_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_binary_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_binary_return_by_ref_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_kron_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_lerp_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_cond_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_cross_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_eigvalsh_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_eigvalsh_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_matrix_norm_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_norm_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_pinv_hermitian_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log1p_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log2_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log2_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logical_and_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logical_and_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logspace_tensor_overload_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_lu_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_lu_solve_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_lu_unpack_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_mH_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_mT_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_amax_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_argmin_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_cumprod_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_cumprod_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_cumsum_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_max_binary_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_max_reduction_no_dim_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_max_reduction_no_dim_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_max_reduction_with_dim_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_maximum_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_maximum_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_maximum_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_maximum_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_min_reduction_no_dim_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_mode_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_movedim_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nanmean_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_narrow_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_neg_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nextafter_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool3d_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_conv1d_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_embedding_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_feature_alpha_dropout_with_train_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_feature_alpha_dropout_without_train_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_grid_sample_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_grid_sample_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_interpolate_area_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_interpolate_bilinear_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_interpolate_nearest-exact_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_local_response_norm_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_max_pool3d_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_multilabel_margin_loss_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_multilabel_soft_margin_loss_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_multilabel_soft_margin_loss_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_pairwise_distance_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_pixel_unshuffle_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_pixel_unshuffle_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_prelu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_scaled_dot_product_attention_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_smooth_l1_loss_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_soft_margin_loss_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_soft_margin_loss_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_upsample_nearest_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_norm_inf_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_normal_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_normal_number_mean_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ones_like_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polar_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polygamma_polygamma_n_0_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polygamma_polygamma_n_3_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polygamma_polygamma_n_4_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pow_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_prod_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_put_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_rad2deg_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_randint_like_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_randint_like_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_real_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_repeat_interleave_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_reshape_as_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_resize__cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_roll_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_round_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_round_decimals_0_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_round_decimals_neg_3_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_rsub_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scatter_reduce_prod_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scatter_reduce_sum_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_bartlett_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_exponential_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signbit_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_airy_ai_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_airy_ai_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_chebyshev_polynomial_t_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_chebyshev_polynomial_u_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_chebyshev_polynomial_v_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_i0e_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_i1_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_laguerre_polynomial_l_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_i0_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_i0_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_i0_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_k0_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_k1_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_k1_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_ndtr_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_ndtri_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_ndtri_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_scaled_modified_bessel_k0_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_scaled_modified_bessel_k1_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_spherical_bessel_j0_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_spherical_bessel_j0_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_split_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_std_mean_unbiased_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_take_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_tan_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_topk_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_torch_ops_aten__efficient_attention_forward_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_true_divide_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unflatten_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unfold_copy_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unique_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unique_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unsafe_split_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_var_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_var_mean_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_var_mean_unbiased_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_var_unbiased_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_vsplit_cuda_float32 2024-06-26T07:44:32.2374087Z 2024-06-26T07:44:35.0848280Z Running inductor/test_inductor_utils 1/1 ... [2024-06-26 07:44:35.084265] 2024-06-26T07:44:35.0852274Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_inductor_utils.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:44:35.084728] 2024-06-26T07:44:37.9050130Z 2024-06-26T07:44:37.9051875Z inductor/test_inductor_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_inductor_utils_1.1_44e7b565483f6937_.log 2024-06-26T07:44:40.8124166Z 2024-06-26T07:44:40.8125269Z Running inductor/test_inplacing_pass 1/1 ... [2024-06-26 07:44:40.811934] 2024-06-26T07:44:40.8127824Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_inplacing_pass.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:44:40.812254] 2024-06-26T07:44:48.6436419Z 2024-06-26T07:44:48.6438141Z inductor/test_inplacing_pass 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_inplacing_pass_1.1_61eba4467e022075_.log 2024-06-26T07:44:48.6441637Z Running 5 items in this shard: test/inductor/test_inplacing_pass.py::TestReinplacingPassCorrectness::test_dont_modify_input, test/inductor/test_inplacing_pass.py::TestReinplacingPassCorrectness::test_dont_modify_live, test/inductor/test_inplacing_pass.py::TestReinplacingPassCorrectness::test_dont_modify_view_of_live, test/inductor/test_inplacing_pass.py::TestReinplacingPassCorrectness::test_should_modify_inner, test/inductor/test_inplacing_pass.py::TestReinplacingPassCorrectness::test_should_modify_input 2024-06-26T07:44:48.6444347Z 2024-06-26T07:44:51.5816263Z Running inductor/test_config 1/1 ... [2024-06-26 07:44:51.581059] 2024-06-26T07:44:51.5819370Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_config.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:44:51.581385] 2024-06-26T07:44:51.8104164Z 2024-06-26T07:44:51.8105990Z inductor/test_torchinductor_opinfo 13/16 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_13.16_13ab8c4fc0a42b82_.log 2024-06-26T07:44:51.8229533Z Running 215 items in this shard: test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_H_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__native_batch_norm_legit_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__upsample_bilinear2d_aa_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_acosh_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_acosh_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_addcdiv_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_addr_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_amin_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_argmin_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_argsort_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_as_strided_partial_views_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_asinh_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_asinh_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_asinh_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_atan2_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_atan_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bitwise_not_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bitwise_xor_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_broadcast_to_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bucketize_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_byte_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_byte_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cdouble_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ceil_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cfloat_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_chunk_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_clamp_max_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_clamp_min_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_conj_physical_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_count_nonzero_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cumprod_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_div_no_rounding_mode_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_div_no_rounding_mode_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_div_trunc_rounding_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_dot_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_double_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_like_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_permuted_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_strided_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_erfc_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_expand_as_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_expand_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_expm1_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_hfftn_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ihfft_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_irfft_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_irfftn_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_irfftn_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fill_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_flip_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_flip_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_float_power_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_floor_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_floor_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_floor_divide_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fmin_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_gather_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ge_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ge_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_geometric_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_geometric_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_geometric_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_hstack_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_i0_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_add_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_add_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_fill_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_fill_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_reduce_amax_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_reduce_amin_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_reduce_mean_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_int_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isfinite_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isneginf_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isreal_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_item_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_binary_return_by_ref_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_le_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_cholesky_ex_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_cross_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_cross_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_diagonal_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_solve_ex_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log10_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log10_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log1p_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log_softmax_with_dtype_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_log_softmax_with_dtype_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logaddexp_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logcumsumexp_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logical_not_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logit_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logspace_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logspace_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logsumexp_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_lt_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_mH_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_argmin_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_log_softmax_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_logsumexp_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_mean_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_mean_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_prod_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_prod_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_select_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_std_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_matmul_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_min_binary_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_min_binary_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_min_reduction_with_dim_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_movedim_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_msort_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_msort_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_mul_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_multinomial_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nansum_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_narrow_copy_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_native_dropout_backward_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ne_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_new_empty_strided_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_new_zeros_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_binary_cross_entropy_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_binary_cross_entropy_with_logits_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_channel_shuffle_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_conv2d_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_conv_transpose1d_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_cross_entropy_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_ctc_loss_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_dropout2d_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_dropout3d_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_embedding_bag_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_fractional_max_pool3d_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_hardtanh_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_interpolate_bicubic_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_max_pool1d_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_pad_replicate_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_relu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_rms_norm_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_rms_norm_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_smooth_l1_loss_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_softmin_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_softmin_with_dtype_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_threshold_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_triplet_margin_with_distance_loss_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nonzero_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nonzero_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ones_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ones_like_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_permute_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_prod_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_put_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_randn_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_randn_like_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_renorm_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_reshape_as_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_reshape_as_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_reshape_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_resize__cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_resize_as__cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_resolve_conj_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_resolve_conj_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_round_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_round_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scatter_reduce_mean_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scatter_reduce_mean_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scatter_reduce_mean_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_searchsorted_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sign_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_cosine_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_gaussian_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_hamming_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sinh_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_slice_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_slice_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_softmax_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sparse_mm_reduce_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_bessel_y0_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_bessel_y1_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_chebyshev_polynomial_u_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_i0e_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_i1e_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_laguerre_polynomial_l_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_ndtr_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_polygamma_special_polygamma_n_0_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_shifted_chebyshev_polynomial_t_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_split_list_args_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_split_with_sizes_copy_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_split_with_sizes_copy_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_square_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_squeeze_multiple_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_stack_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_std_mean_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_std_mean_unbiased_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sum_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sum_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_svd_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_t_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_tensordot_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_topk_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_tril_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_triu_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_true_divide_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unfold_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unfold_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_uniform_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_view_as_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_vsplit_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_vstack_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_xlogy_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_zero__cuda_int64 2024-06-26T07:44:51.8358017Z 2024-06-26T07:44:54.7153146Z Running inductor/test_cpp_wrapper_hipify 1/1 ... [2024-06-26 07:44:54.714668] 2024-06-26T07:44:54.7155131Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cpp_wrapper_hipify.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:44:54.715002] 2024-06-26T07:44:58.6884452Z 2024-06-26T07:44:58.6886291Z inductor/test_cpp_wrapper_hipify 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cpp_wrapper_hipify_1.1_e9ebae6351a551ef_.log 2024-06-26T07:44:58.6888871Z Running 3 items in this shard: test/inductor/test_cpp_wrapper_hipify.py::TestCppWrapperHipify::test_hipify_aoti_driver_header, test/inductor/test_cpp_wrapper_hipify.py::TestCppWrapperHipify::test_hipify_basic_declaration, test/inductor/test_cpp_wrapper_hipify.py::TestCppWrapperHipify::test_hipify_cross_platform 2024-06-26T07:44:58.6890528Z 2024-06-26T07:45:01.5673771Z Running inductor/test_extension_backend 1/1 ... [2024-06-26 07:45:01.566805] 2024-06-26T07:45:01.5675958Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_extension_backend.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:01.567156] 2024-06-26T07:45:02.1182043Z 2024-06-26T07:45:02.1183492Z inductor/test_config 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_config_1.1_2d3624f5a0e8c289_.log 2024-06-26T07:45:02.1188399Z Running 11 items in this shard: test/inductor/test_config.py::TestInductorConfig::test_api_options, test/inductor/test_config.py::TestInductorConfig::test_compile_api, test/inductor/test_config.py::TestInductorConfig::test_compile_api_passes_config, test/inductor/test_config.py::TestInductorConfig::test_get_compiler_config, test/inductor/test_config.py::TestInductorConfig::test_hasattr, test/inductor/test_config.py::TestInductorConfig::test_invalid_backend, test/inductor/test_config.py::TestInductorConfig::test_invalid_names, test/inductor/test_config.py::TestInductorConfig::test_non_inductor_backend, test/inductor/test_config.py::TestInductorConfig::test_patch, test/inductor/test_config.py::TestInductorConfig::test_save_load, test/inductor/test_config.py::TestInductorConfig::test_set 2024-06-26T07:45:02.1192561Z 2024-06-26T07:45:05.0667827Z Running inductor/test_smoke 1/1 ... [2024-06-26 07:45:05.066093] 2024-06-26T07:45:05.0670060Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_smoke.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:05.066462] 2024-06-26T07:45:08.0376091Z 2024-06-26T07:45:08.0377569Z inductor/test_smoke 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_smoke_1.1_e1fdfda78eda197c_.log 2024-06-26T07:45:08.0378468Z 2024-06-26T07:45:11.0273399Z Running inductor/test_indexing 1/1 ... [2024-06-26 07:45:11.026738] 2024-06-26T07:45:11.0275721Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_indexing.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:11.027097] 2024-06-26T07:45:18.5076869Z 2024-06-26T07:45:18.5078714Z inductor/test_indexing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_indexing_1.1_0201ed5a8c65659e_.log 2024-06-26T07:45:18.5087011Z Running 16 items in this shard: test/inductor/test_indexing.py::TestIndexingSimplification::test_expand_floor_div_applied, test/inductor/test_indexing.py::TestIndexingSimplification::test_expand_floor_div_skipped, test/inductor/test_indexing.py::TestIndexingSimplification::test_indexing_join, test/inductor/test_indexing.py::TestIndexingSimplification::test_indexing_simplification, test/inductor/test_indexing.py::TestIndexingSimplification::test_int8_unpack, test/inductor/test_indexing.py::TestIndexingSimplification::test_modular_indexing_pairs_merged, test/inductor/test_indexing.py::TestIndexingSimplification::test_modular_indexing_pairs_not_merged, test/inductor/test_indexing.py::ExprPrinterTests::test_print_Min_Max, test/inductor/test_indexing.py::ExprPrinterTests::test_print_ceil, test/inductor/test_indexing.py::ExprPrinterTests::test_print_floor, test/inductor/test_indexing.py::ExprPrinterTests::test_print_floor_div, test/inductor/test_indexing.py::ExprPrinterTests::test_print_pow, test/inductor/test_indexing.py::ExprPrinterTests::test_print_round, test/inductor/test_indexing.py::ExprPrinterTests::test_print_round_decimal_ndigits_-1, test/inductor/test_indexing.py::ExprPrinterTests::test_print_round_decimal_ndigits_0, test/inductor/test_indexing.py::ExprPrinterTests::test_print_round_decimal_ndigits_1 2024-06-26T07:45:18.5093726Z 2024-06-26T07:45:21.4205129Z Running inductor/test_group_batch_fusion 1/1 ... [2024-06-26 07:45:21.419930] 2024-06-26T07:45:21.4207611Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_group_batch_fusion.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:21.420244] 2024-06-26T07:45:24.0785762Z 2024-06-26T07:45:24.0787441Z inductor/test_extension_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_extension_backend_1.1_ba5fa1db86d1a95d_.log 2024-06-26T07:45:24.0791599Z Running 1 items in this shard: test/inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration 2024-06-26T07:45:24.0792446Z 2024-06-26T07:45:27.0024180Z Running inductor/test_torchbind 1/1 ... [2024-06-26 07:45:27.001734] 2024-06-26T07:45:27.0027163Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchbind.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:27.002038] 2024-06-26T07:45:36.5366723Z 2024-06-26T07:45:36.5368271Z inductor/test_torchbind 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchbind_1.1_a75708f0844c4663_.log 2024-06-26T07:45:36.5369759Z Running 1 items in this shard: test/inductor/test_torchbind.py::TestTorchbind::test_torchbind_inductor 2024-06-26T07:45:36.5370661Z 2024-06-26T07:45:39.5061767Z Running inductor/test_utils 1/1 ... [2024-06-26 07:45:39.505586] 2024-06-26T07:45:39.5071358Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_utils.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:39.505943] 2024-06-26T07:45:43.0277883Z 2024-06-26T07:45:43.0279599Z inductor/test_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_utils_1.1_60246bc6b41dc106_.log 2024-06-26T07:45:43.0281626Z Running 1 items in this shard: test/inductor/test_utils.py::TestUtils::testSympySubs 2024-06-26T07:45:43.0282521Z 2024-06-26T07:45:46.1572353Z Running inductor/test_fx_fusion 1/1 ... [2024-06-26 07:45:46.156676] 2024-06-26T07:45:46.1574619Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_fx_fusion.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:46.156994] 2024-06-26T07:45:50.2308389Z 2024-06-26T07:45:50.2310074Z inductor/test_fx_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_fx_fusion_1.1_f1abefe51a6cfa81_.log 2024-06-26T07:45:50.2312589Z Running 4 items in this shard: test/inductor/test_fx_fusion.py::TestFxFusion::test_linear_permute_fusion, test/inductor/test_fx_fusion.py::TestFxFusion::test_permute_bmm_fusion, test/inductor/test_fx_fusion.py::TestFxFusion::test_permute_linear_fusion, test/inductor/test_fx_fusion.py::TestFxFusion::test_sink_cat_after_pointwise 2024-06-26T07:45:50.2314291Z 2024-06-26T07:45:53.1390106Z Running inductor/test_metrics 1/1 ... [2024-06-26 07:45:53.138488] 2024-06-26T07:45:53.1393072Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_metrics.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:45:53.138817] 2024-06-26T07:45:59.8175566Z 2024-06-26T07:45:59.8177064Z inductor/test_metrics 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_metrics_1.1_c110c5bd840602fa_.log 2024-06-26T07:45:59.8180063Z Running 6 items in this shard: test/inductor/test_metrics.py::TestMetrics::test_atomic_add, test/inductor/test_metrics.py::TestMetrics::test_count_args, test/inductor/test_metrics.py::TestMetrics::test_count_pattern, test/inductor/test_metrics.py::TestMetrics::test_kernel_args_num_gb, test/inductor/test_metrics.py::TestMetrics::test_parse_proper_kernel_fn_code, test/inductor/test_metrics.py::TestMetrics::test_parse_reduction_hint 2024-06-26T07:45:59.8182296Z 2024-06-26T07:46:03.0496918Z Running inductor/test_triton_kernels 1/1 ... [2024-06-26 07:46:03.049127] 2024-06-26T07:46:03.0499503Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_triton_kernels.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:46:03.049519] 2024-06-26T07:46:43.1993634Z 2024-06-26T07:46:43.1995836Z inductor/test_torchinductor_opinfo 4/16 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_opinfo_4.16_5e420b85eaf8f346_.log 2024-06-26T07:46:43.2120464Z Running 228 items in this shard: test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_H_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive___rdiv___cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive___rmod___cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive___rpow___cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__chunk_cat_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__softmax_backward_data_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive__unsafe_masked_index_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_abs_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_acosh_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_add_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_amax_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_amax_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_aminmax_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_argmax_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_argmax_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_argmin_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_as_strided_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_asin_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_atan_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_atleast_2d_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bfloat16_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bfloat16_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_bool_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_broadcast_shapes_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cartesian_prod_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cat_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_chalf_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_char_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_chunk_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_chunk_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_clamp_max_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_clamp_min_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_combinations_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_conj_physical_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_constant_pad_nd_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_contiguous_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_copysign_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_copysign_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cos_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cos_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_cosh_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diag_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagflat_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_diagonal_copy_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_div_floor_rounding_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_dot_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_empty_permuted_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp2_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_exp_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_expand_as_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_expand_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_expm1_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_expm1_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_eye_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_fft_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_hfft2_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_hfft2_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_hfft_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ifftn_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ifftshift_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ihfft2_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_ihfft2_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fft_irfftn_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fill_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_flatten_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_float_power_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_fmax_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_full_like_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_full_like_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_geometric_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_gt_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_gt_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_histc_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_hypot_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_copy_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_fill_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_fill_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_put_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_reduce_amin_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_reduce_amin_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_reduce_mean_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_index_select_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isclose_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isfinite_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isinf_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isinf_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isneginf_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isposinf_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_isreal_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_2inputs_2outputs_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_4inputs_with_extra_args_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_binary_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_binary_return_by_ref_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_jiterator_unary_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_kthvalue_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ldexp_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_det_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_diagonal_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_factor_ex_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_pinv_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_solve_triangular_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_tensorinv_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_vander_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_vector_norm_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linspace_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linspace_tensor_overload_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logaddexp2_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logical_and_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_logical_or_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_amax_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_amin_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_argmin_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_argmin_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_argmin_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_cumprod_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_norm_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_prod_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_scatter_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_softmin_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_std_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_std_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_sum_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_var_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_masked_var_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_max_reduction_no_dim_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_max_reduction_with_dim_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_max_reduction_with_dim_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_meshgrid_variadic_tensors_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_meshgrid_variadic_tensors_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_meshgrid_variadic_tensors_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_min_reduction_no_dim_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_msort_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_mvlgamma_mvlgamma_p_3_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nanquantile_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ne_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ne_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_neg_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_new_empty_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_new_empty_strided_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_adaptive_avg_pool3d_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool1d_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_conv1d_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_conv3d_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_conv_transpose2d_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_cross_entropy_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_group_norm_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_interpolate_bilinear_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_logsigmoid_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_max_unpool1d_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_max_unpool3d_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_mish_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_mish_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_pixel_shuffle_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_poisson_nll_loss_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_scaled_dot_product_attention_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_silu_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_softshrink_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_threshold_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_threshold_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_unfold_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_upsample_bilinear_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nonzero_static_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_norm_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_norm_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_norm_inf_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_normal_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_normal_in_place_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_outer_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polygamma_polygamma_n_1_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polygamma_polygamma_n_2_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polygamma_polygamma_n_3_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_polygamma_polygamma_n_4_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_prod_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_put_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_put_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_ravel_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_real_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_repeat_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_repeat_interleave_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_repeat_interleave_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_reshape_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_reshape_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_resolve_conj_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_resolve_neg_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_rot90_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_rsqrt_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scalar_tensor_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scatter_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_scatter_reduce_prod_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_select_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_short_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_short_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sign_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_bartlett_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_general_cosine_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signal_windows_hamming_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_signbit_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sinc_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_bessel_y0_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_chebyshev_polynomial_u_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_erfcx_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_erfcx_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_laguerre_polynomial_l_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_k0_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_modified_bessel_k1_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_polygamma_special_polygamma_n_0_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_scaled_modified_bessel_k0_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_scaled_modified_bessel_k0_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_shifted_chebyshev_polynomial_u_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_special_spherical_bessel_j0_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_split_list_args_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_sqrt_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_squeeze_cuda_int32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_stack_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_t_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_tile_cuda_int64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_to_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_triangular_solve_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_tril_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unfold_copy_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unfold_cuda_uint8, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unique_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_unsafe_split_cuda_float64, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_view_cuda_bool, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_xlogy_cuda_float16, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_xlogy_cuda_float32, test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_zeros_cuda_bool 2024-06-26T07:46:43.2239137Z 2024-06-26T07:46:46.2115356Z Running inductor/test_foreach 1/1 ... [2024-06-26 07:46:46.211001] 2024-06-26T07:46:46.2117602Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_foreach.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:46:46.211306] 2024-06-26T07:46:48.5844770Z 2024-06-26T07:46:48.5846256Z inductor/test_group_batch_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_group_batch_fusion_1.1_8cfd52ff0c78f475_.log 2024-06-26T07:46:48.5855401Z Running 11 items in this shard: test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_batch_layer_norm_fusion, test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_batch_linear_lhs_fusion, test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_batch_linear_pre_grad_fusion, test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_gate_fusion_post_grad, test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_group_linear_fusion, test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_group_linear_fusion_different_shapes, test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_pointwise_op_fusion, test/inductor/test_group_batch_fusion.py::TestGroupBatchFusion::test_pointwise_op_fusion_post_grad, test/inductor/test_group_batch_fusion.py::TestPostGradBatchLinearFusion::test_batch_linear_post_grad_fusion, test/inductor/test_group_batch_fusion.py::TestFindIndependentSubsetGreedy::test_find_independent_subset_greedy, test/inductor/test_group_batch_fusion.py::TestFindIndependentSubsetGreedy::test_find_independent_subset_greedy_fuse 2024-06-26T07:46:48.5864354Z 2024-06-26T07:46:51.6329123Z Running inductor/test_halide 1/1 ... [2024-06-26 07:46:51.632343] 2024-06-26T07:46:51.6331788Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_halide.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:46:51.632763] 2024-06-26T07:46:56.5549130Z 2024-06-26T07:46:56.5550755Z inductor/test_halide 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_halide_1.1_843967e48ebe8469_.log 2024-06-26T07:46:56.5551962Z 2024-06-26T07:46:59.6348643Z Running test_ops 2/7 ... [2024-06-26 07:46:59.634281] 2024-06-26T07:46:59.6351427Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops.py', '-m', 'not serial', '--shard-id=2', '--num-shards=7', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:46:59.634602] 2024-06-26T07:47:24.2332495Z 2024-06-26T07:47:24.2334194Z inductor/test_triton_kernels 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_triton_kernels_1.1_b31a980d81377bd5_.log 2024-06-26T07:47:24.2432746Z Running 181 items in this shard: test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_False_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_False_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_2d_autotune_grad_True_dynamic_True_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_False_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_False_dynamic_True_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_False_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_aot_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_aot_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_aot_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_eager_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_eager_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_eager_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_inductor_grid_type_1, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_inductor_grid_type_2, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_autotune_grad_True_dynamic_True_backend_inductor_grid_type_3, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_caching, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_caching_duplicate, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_constants, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_dependancies, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_different_shapes_size_16_dynamic_False, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_different_shapes_size_16_dynamic_True, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_different_shapes_size_4_dynamic_False, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_different_shapes_size_4_dynamic_True, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_equal_to_1_arg_dynamic_False, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_equal_to_1_arg_dynamic_True, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_equal_to_1_float_arg_dynamic_False, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_equal_to_1_float_arg_dynamic_True, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_fallback, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_functionalize, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_higher_order_func, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_inner_triton_function_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_inner_triton_function_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_inner_triton_function_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_inputs_buffer_reuse, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_matmul_tracking, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_multi_kernel_grad_False, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_multi_kernel_grad_True, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_mutation_not_mark_dirty, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_mutation_type, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_False_dynamic_False_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_False_dynamic_False_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_False_dynamic_False_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_False_dynamic_True_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_False_dynamic_True_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_False_dynamic_True_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_True_dynamic_False_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_True_dynamic_False_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_True_dynamic_False_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_True_dynamic_True_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_True_dynamic_True_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_native_grad_True_dynamic_True_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_no_clones_grad_False_dynamic_False, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_no_clones_grad_False_dynamic_True, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_no_clones_grad_True_dynamic_False, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_no_clones_grad_True_dynamic_True, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_num_ctas_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_num_ctas_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_num_ctas_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_out_of_order, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_reinplace_inplaceable_pass, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_reset_to_zero, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_slice_and_view_input, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_special_kwargs_with_autotune_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_special_kwargs_with_autotune_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_special_kwargs_with_autotune_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_special_kwargs_without_autotune_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_special_kwargs_without_autotune_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_special_kwargs_without_autotune_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_strided_input, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_strided_input_nonzero_offset, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_triton_dtype_dynamic_False_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_triton_dtype_dynamic_False_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_triton_dtype_dynamic_False_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_triton_dtype_dynamic_True_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_triton_dtype_dynamic_True_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_triton_dtype_dynamic_True_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_unbacked_shape_tensor_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_unbacked_shape_tensor_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_unbacked_shape_tensor_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_various_args, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_grad_option_grad_fn0_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_grad_option_grad_fn0_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_grad_option_grad_fn0_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_grad_option_grad_fn1_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_grad_option_grad_fn1_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_grad_option_grad_fn1_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_imported_symbol, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_imported_symbol_with_custom_name, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_kernel_param, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_views_dynamic_False_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_views_dynamic_False_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_views_dynamic_False_backend_inductor, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_views_dynamic_True_backend_aot_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_views_dynamic_True_backend_eager, test/inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_with_views_dynamic_True_backend_inductor, test/inductor/test_triton_kernels.py::MutationTests::test_add_for_loop, test/inductor/test_triton_kernels.py::MutationTests::test_add_for_loop2, test/inductor/test_triton_kernels.py::MutationTests::test_add_nested_for_loop, test/inductor/test_triton_kernels.py::MutationTests::test_add_nested_for_loop_multi_return, test/inductor/test_triton_kernels.py::MutationTests::test_argmax, test/inductor/test_triton_kernels.py::MutationTests::test_cumsum, test/inductor/test_triton_kernels.py::MutationTests::test_fn_call_multi_return, test/inductor/test_triton_kernels.py::MutationTests::test_fn_call_one_return, test/inductor/test_triton_kernels.py::MutationTests::test_for_loop_arg, test/inductor/test_triton_kernels.py::MutationTests::test_for_loop_arg_2, test/inductor/test_triton_kernels.py::MutationTests::test_labels, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_add_4_times_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_add_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_add_kernel_2d_autotuned, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_add_kernel_with_block_ptr, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_add_kernel_with_import, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_atomic_add_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_cond_op_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_indirection_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_indirection_kernel1, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_inline_asm_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_kernel_with_block_ptr_2d, test/inductor/test_triton_kernels.py::MutationTests::test_mutations_mul2_inplace_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_nested_cond_op_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_out_of_order_kernel, test/inductor/test_triton_kernels.py::MutationTests::test_out_of_order_kernel_call, test/inductor/test_triton_kernels.py::MutationTests::test_reduce_sum, test/inductor/test_triton_kernels.py::MutationTests::test_triton_kernel_inference_mode, test/inductor/test_triton_kernels.py::MutationTests::test_while_loop 2024-06-26T07:47:24.2528118Z 2024-06-26T07:47:27.0858073Z Running test_sparse_csr 2/2 ... [2024-06-26 07:47:27.085285] 2024-06-26T07:47:27.0860241Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sparse_csr.py', '-m', 'not serial', '--shard-id=2', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:47:27.085598] 2024-06-26T07:49:42.7693054Z 2024-06-26T07:49:42.7696517Z inductor/test_foreach 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_foreach_1.1_9969b32df902293a_.log 2024-06-26T07:49:42.7782601Z Running 200 items in this shard: test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_2d_blocking_partitioning_elems__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_aliasing, test/inductor/test_foreach.py::ForeachTests::test_broadcasting__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_broadcasting__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_broadcasting__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_broadcasting__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_cpu_cpp_fallback__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_decomp__foreach_addcdiv, test/inductor/test_foreach.py::ForeachTests::test_decomp__foreach_addcmul, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_dynamic_shapes_fallback__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_foreach_cpp_wrapper_cuda, test/inductor/test_foreach.py::ForeachTests::test_fuse_concat, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_fusion_duplicate_buffer_list__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_list__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_list__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_list__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_list__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_scalar__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_scalar__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_scalar__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_kernel_split_arg_limit_scalar__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_multi_device, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_abs, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_neg, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_reciprocal, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_sign, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_sqrt, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_list__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_abs, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_neg, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_reciprocal, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_sign, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_sqrt, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_list__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_scalar__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_scalar__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_scalar__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_producer_scalar__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_scalar__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_scalar__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_scalar__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_consumer_scalar__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_abs, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_neg, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_reciprocal, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_sign, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_sqrt, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_list__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_scalar__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_scalar__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_scalar__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_non_foreach_producer_scalar__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_reinplacing__foreach_add_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing__foreach_div_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing__foreach_mul_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing__foreach_sub_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_after__foreach_add_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_after__foreach_div_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_after__foreach_mul_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_after__foreach_sub_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_before__foreach_add_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_before__foreach_div_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_before__foreach_mul_, test/inductor/test_foreach.py::ForeachTests::test_reinplacing_mut_before__foreach_sub_, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_abs, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_neg, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_reciprocal, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_sign, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_sqrt, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_list__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_scalar__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_scalar__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_scalar__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_scheduler_fusion_scalar__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_abs, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_neg, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_reciprocal, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_sign, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_sqrt, test/inductor/test_foreach.py::ForeachTests::test_single_list__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_single_scalar__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_single_scalar__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_single_scalar__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_single_scalar__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_single_scalar_tensor__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_single_scalar_tensor__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_abs, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_neg, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_reciprocal, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_sign, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_sqrt, test/inductor/test_foreach.py::ForeachTests::test_singleton_lists__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_add, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_clamp_max, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_clamp_min, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_copy, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_div, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_maximum, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_minimum, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_mul, test/inductor/test_foreach.py::ForeachTests::test_type_promotion__foreach_sub, test/inductor/test_foreach.py::ForeachTests::test_zero_elems 2024-06-26T07:49:42.7867721Z 2024-06-26T07:49:45.6493225Z Running test_foreach 1/1 ... [2024-06-26 07:49:45.648699] 2024-06-26T07:49:45.6495116Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_foreach.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:49:45.649029] 2024-06-26T07:56:02.0642110Z 2024-06-26T07:56:02.0645423Z test_foreach 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_foreach_1.1_3c3369c72938ff4f_.log 2024-06-26T07:56:02.2249401Z Running 3463 items in this shard: test/test_foreach.py::TestForeachCUDA::test_0dim_tensor_overload_cpu_ok_cuda, test/test_foreach.py::TestForeachCUDA::test_0dim_tensor_overload_exception_cuda, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_add_scalar_with_empty_list_and_empty_tensor_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_abs_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_acos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_addcdiv_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_addcmul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_asin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_atan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_ceil_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_cos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_cosh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_erf_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_erfc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_exp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_expm1_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_floor_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_frac_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_lerp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_lgamma_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_log10_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_log1p_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_log2_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_log_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_neg_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_norm_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_reciprocal_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_round_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_sigmoid_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_sign_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_sin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_sinh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_sqrt_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_tan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_tanh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_trunc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_all_zero_size_tensors_do_not_launch_kernel__foreach_zero_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_abs_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_abs_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_abs_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_abs_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_acos_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_acos_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_acos_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_acos_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_add_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_add_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_add_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_add_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcdiv_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcdiv_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcdiv_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcdiv_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcmul_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcmul_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcmul_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_addcmul_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_asin_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_asin_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_asin_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_asin_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_atan_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_atan_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_atan_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_atan_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_ceil_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_ceil_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_ceil_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_ceil_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_max_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_max_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_max_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_max_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_min_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_min_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_min_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_clamp_min_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_copy_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_copy_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_copy_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_copy_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cos_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cos_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cos_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cos_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cosh_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cosh_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cosh_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_cosh_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_div_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_div_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_div_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_div_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erf_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erf_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erf_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erf_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erfc_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erfc_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erfc_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_erfc_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_exp_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_exp_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_exp_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_exp_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_expm1_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_expm1_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_expm1_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_expm1_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_floor_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_floor_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_floor_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_floor_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_frac_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_frac_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_frac_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_frac_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lerp_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lerp_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lerp_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lerp_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lgamma_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lgamma_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lgamma_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_lgamma_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log10_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log10_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log10_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log10_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log1p_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log1p_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log1p_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log1p_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log2_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log2_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log2_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log2_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_log_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_max_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_max_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_max_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_max_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_maximum_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_maximum_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_maximum_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_maximum_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_minimum_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_minimum_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_minimum_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_minimum_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_mul_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_mul_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_mul_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_mul_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_neg_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_neg_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_neg_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_neg_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_norm_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_norm_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_norm_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_norm_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_pow_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_pow_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_pow_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_pow_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_reciprocal_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_reciprocal_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_reciprocal_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_reciprocal_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_round_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_round_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_round_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_round_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sigmoid_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sigmoid_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sigmoid_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sigmoid_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sign_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sign_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sign_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sign_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sin_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sin_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sin_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sin_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sinh_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sinh_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sinh_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sinh_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sqrt_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sqrt_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sqrt_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sqrt_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sub_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sub_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sub_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_sub_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tan_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tan_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tan_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tan_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tanh_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tanh_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tanh_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_tanh_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_trunc_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_trunc_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_trunc_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_trunc_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_zero_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_zero_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_zero_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_autodiff__foreach_zero_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_max_use_cuda_graph_False_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_max_use_cuda_graph_False_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_max_use_cuda_graph_True_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_max_use_cuda_graph_True_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_norm_use_cuda_graph_False_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_norm_use_cuda_graph_False_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_norm_use_cuda_graph_True_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_big_num_tensors__foreach_norm_use_cuda_graph_True_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_max_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_max_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_max_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_min_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_min_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_clamp_min_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_div_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_div_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_div_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_maximum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_maximum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_maximum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_minimum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_minimum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_minimum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_mul_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_mul_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_mul_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_pow_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_pow_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_pow_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_sub_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_sub_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_sub_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_min_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_div_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_maximum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_minimum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_mul_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_pow_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_sub_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_add_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_max_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_clamp_min_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_div_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_maximum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_minimum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_mul_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_pow_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_list_slow_path__foreach_sub_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_different_tensor_dtypes__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_add_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_max_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_clamp_min_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_div_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_maximum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_minimum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_mul_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_pow_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_scalar_with_overlapping_tensors__foreach_sub_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_add_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_max_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_clamp_min_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_div_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_maximum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_minimum_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_mul_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_pow_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_tensors_on_different_devices__foreach_sub_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_False_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_binary_op_with_scalar_self_support__foreach_pow_is_fastpath_True_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_div_reciprocal_cuda, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_device_inputs__foreach_copy_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_foreach_copy_with_multi_dtypes__foreach_copy_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_foreach_l2_large_value_input__foreach_norm_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_foreach_l2_large_value_input__foreach_norm_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_max_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_foreach_reduce_large_input__foreach_norm_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_abs_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_acos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_addcdiv_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_addcmul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_asin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_atan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_ceil_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_copy_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_cos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_cosh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_erf_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_erfc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_exp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_expm1_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_floor_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_frac_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_lerp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_lgamma_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_log10_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_log1p_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_log2_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_log_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_neg_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_reciprocal_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_round_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_sigmoid_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_sign_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_sin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_sinh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_sqrt_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_tan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_tanh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_trunc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_inplace_foreach_leaf_check_and_grad_fn__foreach_zero_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_exp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_expm1_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_reciprocal_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_sigmoid_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_sqrt_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_tan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_lifetime_of_grad_fn_when_result_is_saved__foreach_tanh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_abs_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_acos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_add_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_addcdiv_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_addcmul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_asin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_atan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_ceil_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_clamp_max_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_clamp_min_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_cos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_cosh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_div_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_erf_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_erfc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_exp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_expm1_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_floor_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_frac_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_lerp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_lgamma_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_log10_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_log1p_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_log2_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_log_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_maximum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_minimum_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_mul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_neg_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_pow_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_reciprocal_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_round_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_sigmoid_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_sign_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_sin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_sinh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_sqrt_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_sub_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_tan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_tanh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_outplace_with_invalid_grads__foreach_trunc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_abs_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_acos_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_add_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcdiv_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_addcmul_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_asin_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_atan_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_ceil_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_max_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_clamp_min_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_copy_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cos_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_cosh_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_div_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erf_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_erfc_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_exp_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_expm1_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_floor_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_frac_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lerp_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_lgamma_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log10_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log1p_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log2_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_log_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_max_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_maximum_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_minimum_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_mul_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_neg_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_norm_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_pow_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_reciprocal_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_round_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sigmoid_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sign_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sin_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sinh_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sqrt_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_sub_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tan_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_tanh_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_trunc_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_fastpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_inplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_parity__foreach_zero_slowpath_outplace_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_tensors_on_different_devices__foreach_addcdiv_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_tensors_on_different_devices__foreach_addcdiv_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_tensors_on_different_devices__foreach_addcmul_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_tensors_on_different_devices__foreach_addcmul_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_False_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcdiv_is_fastpath_True_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_False_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_pointwise_op_with_tensor_of_scalarlist_overload__foreach_addcmul_is_fastpath_True_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_tensors_grouping_cuda, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_abs_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_acos_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_asin_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_atan_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_ceil_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cos_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_cosh_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erf_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_erfc_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_exp_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_expm1_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_floor_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_frac_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_lgamma_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log10_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log1p_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log2_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_log_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_neg_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_reciprocal_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_round_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sigmoid_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sign_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sin_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sinh_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_sqrt_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tan_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_tanh_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_trunc_cuda_uint8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_bfloat16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_bool, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_complex128, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_complex64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_float16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_float32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_float64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_int16, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_int32, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_int64, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_int8, test/test_foreach.py::TestForeachCUDA::test_unary_op_tensors_on_different_devices__foreach_zero_cuda_uint8 2024-06-26T07:56:02.3861450Z 2024-06-26T07:56:04.8868151Z Running test_transformers 1/3 ... [2024-06-26 07:56:04.886272] 2024-06-26T07:56:04.8872151Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_transformers.py', '-m', 'not serial', '--shard-id=1', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '-x', '--reruns=2', '--import-slow-tests', '--import-disabled-tests'] ... [2024-06-26 07:56:04.886627] 2024-06-26T07:58:31.4685791Z 2024-06-26T07:58:31.4687149Z test_ops 2/7 was successful, full logs can be found in artifacts with path test/test-reports/test_ops_2.7_6d7de55c4ddf1d61_.log 2024-06-26T07:58:31.6726788Z Running 4630 items in this shard: test/test_ops.py::TestCommonCUDA::test_compare_cpu___rdiv___cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu___rmul___cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs__conversions_complex_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs__conversions_float_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_alias_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_atleast_2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_cauchy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_contiguous_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_diag_embed_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_diagonal_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_div_no_rounding_mode_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_dstack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_linalg_matrix_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_linalg_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_narrow_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_nn_functional_dropout_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_nn_functional_pixel_shuffle_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_renorm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_rot90_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_stack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__refs_t_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__segment_reduce_lengths_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu__upsample_bilinear2d_aa_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_as_strided_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_atan2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_bucketize_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_byte_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_cholesky_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_chunk_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_cov_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_div_floor_rounding_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_div_trunc_rounding_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_eye_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_flipud_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_hstack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_index_add_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_kthvalue_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_linalg_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_linalg_pinv_singular_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_linalg_solve_ex_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_lu_solve_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_lu_unpack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_masked_cumprod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_masked_log_softmax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_masked_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_max_pool2d_with_indices_backward_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_max_reduction_with_dim_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_min_reduction_no_dim_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_new_empty_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_adaptive_max_pool3d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_avg_pool1d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_avg_pool3d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_cross_entropy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_fractional_max_pool2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_instance_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_interpolate_linear_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_max_unpool3d_grad_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_normalize_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_pad_reflect_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_poisson_nll_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nn_functional_softmin_with_dtype_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_nonzero_static_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_ones_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_pca_lowrank_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_polar_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_randint_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_randn_like_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_reshape_as_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_scatter_add_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_special_chebyshev_polynomial_v_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_special_laguerre_polynomial_l_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_special_shifted_chebyshev_polynomial_t_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_svd_lowrank_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_trace_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_trapz_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_tril_cuda_float32, test/test_ops.py::TestCommonCUDA::test_compare_cpu_vstack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_acos_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_atleast_1d_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_atleast_3d_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_cat_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_conj_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_cos_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_empty_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_empty_permuted_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_eq_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_exp_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_fft_irfft2_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_index_add_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_isinf_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_isreal_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_mT_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_nanmean_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_neg_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_new_zeros_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_randn_like_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_repeat_interleave_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_rsqrt_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_sgn_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_sigmoid_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_stack_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_sum_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_triu_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_complex_half_reference_testing_unfold_copy_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_dtypes___rand___cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__batch_norm_with_update_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_T_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs__conversions_complex_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_abs_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_alias_copy_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_as_strided_copy_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_as_strided_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_as_strided_partial_views_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_atanh_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_atleast_3d_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_bucketize_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_ceil_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_conj_physical_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_copysign_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_cumprod_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_diag_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_diagonal_scatter_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_dstack_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_erfinv_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_fft_ifft2_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_fft_ifftshift_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_fft_irfft_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_fft_rfft_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_fft_rfftn_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_float_power_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_fmax_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_i0_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_index_add_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_isfinite_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_isinf_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_lcm_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_linalg_norm_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_logical_not_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_meshgrid_variadic_tensors_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_new_empty_strided_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_new_full_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_nn_functional_elu_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_nn_functional_hardtanh_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_nn_functional_hinge_embedding_loss_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_nn_functional_margin_ranking_loss_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_nn_functional_poisson_nll_loss_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_nn_functional_smooth_l1_loss_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_prod_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_reshape_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_sin_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_softmax_with_dtype_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_special_entr_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_special_erfcx_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_special_i1e_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_special_logit_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_special_spherical_bessel_j0_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_squeeze_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_take_along_dim_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__refs_unsqueeze_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__softmax_backward_data_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes__unsafe_masked_index_put_accumulate_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_add_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_addr_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_all_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_argwhere_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_as_strided_scatter_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_atanh_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_atleast_2d_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_bincount_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_broadcast_to_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_cartesian_prod_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_clamp_max_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_clone_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_conj_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_contiguous_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_cov_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_diff_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_double_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_exp_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_expm1_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_fft_ifft2_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_fft_rfft2_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_fft_rfft_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_fill_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_fmax_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_index_select_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_isnan_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_isreal_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_istft_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_jiterator_binary_return_by_ref_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_kthvalue_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_lgamma_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_linalg_ldl_factor_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_linalg_lu_factor_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_linalg_lu_factor_ex_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_linalg_multi_dot_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_linalg_svd_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_linalg_vector_norm_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_linspace_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_log_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_log_normal_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_logcumsumexp_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_logit_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_logspace_tensor_overload_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_mT_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_masked_fill_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_masked_median_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_masked_var_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_matrix_exp_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_max_binary_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_min_reduction_with_dim_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_minimum_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_movedim_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_msort_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nansum_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_ne_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_new_empty_strided_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_new_ones_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_adaptive_avg_pool1d_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_conv_transpose1d_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_cross_entropy_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_dropout2d_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_dropout_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_glu_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_hardshrink_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_interpolate_area_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_interpolate_linear_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_kl_div_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_l1_loss_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_leaky_relu_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_max_pool2d_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_max_unpool1d_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_mish_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_multi_head_attention_forward_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_multilabel_soft_margin_loss_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_normalize_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_one_hot_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_pad_replicate_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_pairwise_distance_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_nn_functional_triplet_margin_with_distance_loss_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_randn_like_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_repeat_interleave_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_reshape_as_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_round_decimals_0_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_scatter_reduce_amax_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_scatter_reduce_prod_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_sigmoid_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_signal_windows_hann_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_sinc_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_special_chebyshev_polynomial_v_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_special_hermite_polynomial_h_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_special_i0e_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_special_modified_bessel_i0_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_special_ndtri_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_special_scaled_modified_bessel_k0_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_special_zeta_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_tanh_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_trapz_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_triu_indices_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_unflatten_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_unfold_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_var_unbiased_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_view_as_cuda, test/test_ops.py::TestCommonCUDA::test_dtypes_vstack_cuda, test/test_ops.py::TestCommonCUDA::test_errors_T_cuda, test/test_ops.py::TestCommonCUDA::test_errors___rmul___cuda, test/test_ops.py::TestCommonCUDA::test_errors___rxor___cuda, test/test_ops.py::TestCommonCUDA::test_errors_add_cuda, test/test_ops.py::TestCommonCUDA::test_errors_amin_cuda, test/test_ops.py::TestCommonCUDA::test_errors_bitwise_xor_cuda, test/test_ops.py::TestCommonCUDA::test_errors_dsplit_cuda, test/test_ops.py::TestCommonCUDA::test_errors_eye_cuda, test/test_ops.py::TestCommonCUDA::test_errors_fft_hfft2_cuda, test/test_ops.py::TestCommonCUDA::test_errors_fft_ifft_cuda, test/test_ops.py::TestCommonCUDA::test_errors_fft_irfftn_cuda, test/test_ops.py::TestCommonCUDA::test_errors_index_add_cuda, test/test_ops.py::TestCommonCUDA::test_errors_lcm_cuda, test/test_ops.py::TestCommonCUDA::test_errors_ldexp_cuda, test/test_ops.py::TestCommonCUDA::test_errors_linalg_lstsq_cuda, test/test_ops.py::TestCommonCUDA::test_errors_movedim_cuda, test/test_ops.py::TestCommonCUDA::test_errors_native_layer_norm_cuda, test/test_ops.py::TestCommonCUDA::test_errors_nn_functional_group_norm_cuda, test/test_ops.py::TestCommonCUDA::test_errors_nn_functional_rms_norm_cuda, test/test_ops.py::TestCommonCUDA::test_errors_reshape_cuda, test/test_ops.py::TestCommonCUDA::test_errors_signal_windows_general_cosine_cuda, test/test_ops.py::TestCommonCUDA::test_errors_signal_windows_kaiser_cuda, test/test_ops.py::TestCommonCUDA::test_errors_sparse_mul_layout1_cuda, test/test_ops.py::TestCommonCUDA::test_errors_sparse_randn_like_layout0_cuda, test/test_ops.py::TestCommonCUDA::test_errors_sparse_sum_layout0_cuda, test/test_ops.py::TestCommonCUDA::test_errors_special_chebyshev_polynomial_v_cuda, test/test_ops.py::TestCommonCUDA::test_errors_special_xlog1py_cuda, test/test_ops.py::TestCommonCUDA::test_errors_sum_to_size_cuda, test/test_ops.py::TestCommonCUDA::test_errors_trace_cuda, test/test_ops.py::TestCommonCUDA::test_errors_tril_cuda, test/test_ops.py::TestCommonCUDA::test_errors_view_copy_cuda, test/test_ops.py::TestCommonCUDA::test_errors_vstack_cuda, test/test_ops.py::TestCommonCUDA::test_multiple_devices___rmod___cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices__batch_norm_with_update_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices__chunk_cat_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices__upsample_bilinear2d_aa_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_acos_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_acos_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_acosh_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_addbmm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_addcmul_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_any_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_argmax_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_as_strided_partial_views_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_atan_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_bfloat16_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_bool_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_bucketize_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_cdist_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_clamp_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_clamp_max_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_clamp_max_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_clone_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_column_stack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_conj_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_constant_pad_nd_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_contiguous_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_diff_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_dstack_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_empty_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_empty_permuted_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_exp2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_exp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_expm1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_exponential_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fft_fftn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fft_ifft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fft_ifft2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fft_ifftshift_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fft_irfft2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fft_irfft_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fft_rfft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_flatten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_floor_divide_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_fmod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_gt_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_half_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_histc_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_hstack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_i0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_igammac_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_index_select_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_int_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_int_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_kron_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_le_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_lgamma_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_linalg_eigh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_linalg_lu_factor_ex_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_linalg_matrix_rank_hermitian_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_linalg_multi_dot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_linalg_norm_subgradients_at_zero_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_linalg_svdvals_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_linspace_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_log_normal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_logaddexp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_logical_not_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_logical_or_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_mH_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_masked_amax_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_masked_argmax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_masked_argmin_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_masked_fill_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_masked_log_softmax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_masked_normalize_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_masked_std_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_matrix_exp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_meshgrid_variadic_tensors_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_min_reduction_with_dim_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_minimum_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_multinomial_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_mv_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_mvlgamma_mvlgamma_p_1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nanmedian_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_narrow_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_new_empty_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_new_empty_strided_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_new_full_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_new_zeros_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_fractional_max_pool2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_fractional_max_pool3d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_group_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_leaky_relu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_linear_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_max_unpool2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_mse_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_pad_circular_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_pixel_shuffle_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_prelu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_relu6_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_scaled_dot_product_attention_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_tanhshrink_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_threshold_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_nn_functional_triplet_margin_with_distance_loss_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_ormqr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_outer_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_pow_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_quantile_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_rad2deg_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_randn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_randn_like_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_ravel_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_reshape_as_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_reshape_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_resize__cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_resize_as__cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_resolve_conj_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_rot90_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_round_decimals_neg_3_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_scalar_tensor_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_scatter_reduce_amax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_scatter_reduce_amin_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_signal_windows_hamming_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_signal_windows_nuttall_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_sin_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_slice_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_softmax_with_dtype_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_airy_ai_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_bessel_y0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_bessel_y1_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_hermite_polynomial_h_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_i1e_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_laguerre_polynomial_l_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_log_ndtr_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_modified_bessel_i0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_ndtri_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_polygamma_special_polygamma_n_0_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_shifted_chebyshev_polynomial_v_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_special_xlog1py_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_squeeze_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_squeeze_multiple_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_t_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_take_along_dim_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_tanh_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_to_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_trunc_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_trunc_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_unfold_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_unsqueeze_cuda_int64, test/test_ops.py::TestCommonCUDA::test_multiple_devices_var_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_var_mean_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_view_cuda_float32, test/test_ops.py::TestCommonCUDA::test_multiple_devices_where_cuda_int64, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_H_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values__chunk_cat_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_acosh_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_angle_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_asinh_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_atan2_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_atleast_3d_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_bitwise_and_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_bitwise_xor_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_byte_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_clamp_min_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_combinations_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_cos_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_diag_embed_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_dsplit_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_empty_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_expand_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_fft_fftshift_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_fft_hfftn_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_fft_rfftn_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_i0_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_index_add_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_index_fill_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_ldexp_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_log1p_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_logical_xor_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_masked_mean_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_masked_scatter_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_meshgrid_list_of_tensors_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_min_reduction_with_dim_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_mul_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_nansum_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_narrow_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_ne_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_new_ones_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_new_zeros_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_nonzero_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_nonzero_static_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_put_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_ravel_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_reciprocal_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_reshape_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_resize__cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_roll_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_special_airy_ai_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_special_i0e_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_special_modified_bessel_k1_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_tan_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_to_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_trace_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_transpose_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_view_as_cuda_bool, test/test_ops.py::TestCommonCUDA::test_non_standard_bool_values_zero__cuda_bool, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples___rxor___cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples__chunk_cat_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples__segment_reduce_offsets_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_abs_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_addmm_decomposed_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_alias_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_aminmax_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_any_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_argmax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_argsort_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_as_strided_copy_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_atan_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_atleast_1d_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_atleast_2d_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_atleast_2d_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_baddbmm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_bitwise_or_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cat_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cfloat_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cholesky_inverse_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_chunk_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_constant_pad_nd_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cos_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cov_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cov_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cummin_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cumprod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_cumsum_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_diff_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_diff_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_div_trunc_rounding_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_einsum_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_empty_strided_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_eq_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_equal_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_erfinv_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_exp2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_expm1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_fft_ifftn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_fft_ifftn_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_fft_ihfft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_fft_irfft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_flip_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_flip_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_fliplr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_fliplr_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_floor_divide_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_fmin_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_frac_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_geqrf_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_gradient_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_half_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_heaviside_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_hsplit_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_hypot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_index_add_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_index_fill_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_index_reduce_amax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_isclose_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_isclose_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_isinf_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_isnan_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_item_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_jiterator_2inputs_2outputs_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_jiterator_binary_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_jiterator_binary_return_by_ref_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_kron_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_kthvalue_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_ldexp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_ldexp_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_cholesky_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_cholesky_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_cond_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_det_singular_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_diagonal_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_diagonal_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_householder_product_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_ldl_factor_ex_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_lu_factor_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_matrix_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_matrix_rank_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_multi_dot_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_norm_subgradients_at_zero_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_pinv_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_pinv_singular_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_qr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_slogdet_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_tensorsolve_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_linalg_vander_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_log2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_log_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_logical_not_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_lt_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_mH_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_masked_cumsum_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_masked_select_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_maximum_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_meshgrid_list_of_tensors_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_min_binary_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_minimum_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_mm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_movedim_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_mul_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_mul_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_mvlgamma_mvlgamma_p_3_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_mvlgamma_mvlgamma_p_3_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_native_batch_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_ne_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_neg_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_new_full_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_new_zeros_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_avg_pool1d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_batch_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_conv_transpose1d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_fractional_max_pool3d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_interpolate_nearest-exact_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_l1_loss_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_logsigmoid_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_margin_ranking_loss_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_max_pool2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_max_unpool2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_multilabel_margin_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_pad_replicate_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_pairwise_distance_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_pixel_shuffle_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_pixel_shuffle_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_nn_functional_silu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_norm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_norm_fro_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_norm_fro_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_norm_inf_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_ormqr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_outer_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_pinverse_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_polygamma_polygamma_n_0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_positive_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_prod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_prod_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_put_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_randint_like_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_randn_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_renorm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_resolve_neg_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_round_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_scatter_reduce_prod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_sgn_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_sin_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_sinc_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_sinh_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_slice_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_slice_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_slice_scatter_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_airy_ai_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_chebyshev_polynomial_v_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_i0e_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_i1e_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_legendre_polynomial_p_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_modified_bessel_k1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_scaled_modified_bessel_k0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_shifted_chebyshev_polynomial_u_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_shifted_chebyshev_polynomial_v_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_shifted_chebyshev_polynomial_w_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_special_spherical_bessel_j0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_split_with_sizes_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_split_with_sizes_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_split_with_sizes_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_squeeze_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_stack_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_std_mean_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_std_unbiased_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_t_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_t_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_take_along_dim_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_take_along_dim_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_take_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_tan_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_tensor_split_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_to_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_topk_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_trace_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_trapz_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_unbind_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_unfold_copy_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_unfold_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_unfold_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_uniform_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_unsafe_chunk_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_unsqueeze_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_var_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_var_unbiased_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_vdot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_view_cuda_int64, test/test_ops.py::TestCommonCUDA::test_noncontiguous_samples_zeros_cuda_float32, test/test_ops.py::TestCommonCUDA::test_numpy_ref_aminmax_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_argwhere_cuda_int64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_broadcast_tensors_cuda_int64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_diagflat_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_diagflat_cuda_int64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_equal_cuda_int64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_linalg_vander_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_nn_functional_l1_loss_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_numpy_ref_nn_functional_pairwise_distance_cuda_int64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_permute_cuda_int64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_ravel_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_searchsorted_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_signal_windows_general_cosine_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_signal_windows_nuttall_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_tensor_split_cuda_int64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_unbind_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_view_copy_cuda_float64, test/test_ops.py::TestCommonCUDA::test_numpy_ref_where_cuda_int64, test/test_ops.py::TestCommonCUDA::test_out___rand___cuda_int64, test/test_ops.py::TestCommonCUDA::test_out___rmul___cuda_float32, test/test_ops.py::TestCommonCUDA::test_out___ror___cuda_int64, test/test_ops.py::TestCommonCUDA::test_out__refs_T_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs__conversions_char_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_as_strided_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_atanh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_atleast_1d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_bucketize_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_cauchy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_constant_pad_nd_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_cosh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_erf_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_exp2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_fft_fft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_fft_ihfft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_flatten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_heaviside_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_imag_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out__refs_istft_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out__refs_linspace_tensor_overload_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_logspace_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_masked_fill_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_meshgrid_list_of_tensors_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_new_full_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_nn_functional_hardtanh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_nn_functional_huber_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_nn_functional_l1_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_nn_functional_softmax_with_dtype_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_normal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_rad2deg_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_special_spherical_bessel_j0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_special_zeta_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_sum_to_size_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_trace_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_tril_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__refs_true_divide_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__segment_reduce_lengths_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out__unsafe_masked_index_put_accumulate_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_addcdiv_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_addr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_alias_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_as_strided_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_asinh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_atleast_2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_atleast_3d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_bitwise_and_cuda_int64, test/test_ops.py::TestCommonCUDA::test_out_bitwise_or_cuda_int64, test/test_ops.py::TestCommonCUDA::test_out_bitwise_xor_cuda_int64, test/test_ops.py::TestCommonCUDA::test_out_block_diag_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_clamp_min_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_combinations_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_complex_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_contiguous_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_corrcoef_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_cov_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_cummin_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_deg2rad_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_empty_like_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_fft_fft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_fft_ifft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_fft_rfft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_gcd_cuda_int64, test/test_ops.py::TestCommonCUDA::test_out_geqrf_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_half_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_histc_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_hypot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_igammac_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_index_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_index_reduce_mean_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_jiterator_2inputs_2outputs_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_jiterator_4inputs_with_extra_args_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_jiterator_binary_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_linalg_det_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_linalg_inv_ex_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_linalg_ldl_solve_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_linalg_matrix_rank_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_linalg_pinv_singular_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_linalg_solve_triangular_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_log10_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_logaddexp2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_lu_unpack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_mH_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_masked_softmin_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_matrix_exp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nan_to_num_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_cosine_embedding_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_gelu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_huber_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_interpolate_linear_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_local_response_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_max_pool1d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_max_pool2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_max_pool3d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_nll_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_pad_constant_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_rms_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_soft_margin_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_nn_functional_upsample_nearest_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_polygamma_polygamma_n_4_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_quantile_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_acosh_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_addmm_decomposed_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_as_strided_copy_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_baddbmm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_cholesky_solve_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_clamp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_cross_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_digamma_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_dot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_fft_hfft_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_fft_hfft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_fft_irfft2_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_fft_irfftn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_fft_rfftn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_full_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_hstack_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_hypot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_index_reduce_amin_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_index_reduce_prod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_index_select_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_ldexp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_lerp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_det_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_eigh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_inv_ex_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_lu_factor_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_lu_solve_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_multi_dot_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_tensorinv_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_linalg_vecdot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_log_softmax_with_dtype_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_logaddexp2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_logspace_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_logspace_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_lu_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_lu_solve_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_max_binary_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_mv_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_nansum_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_norm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_ormqr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_outer_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_polar_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_polygamma_polygamma_n_0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_quantile_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_sigmoid_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_stack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_sub_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_take_along_dim_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_tanh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_triangular_solve_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_tril_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_triu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_unfold_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_view_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_vstack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_requires_grad_error_zeros_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_out_rsub_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_sigmoid_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_signal_windows_nuttall_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_slice_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_special_log_ndtr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_special_modified_bessel_k0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_special_scaled_modified_bessel_k0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_special_scaled_modified_bessel_k1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_stack_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_svd_lowrank_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_trapezoid_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_unsqueeze_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_var_unbiased_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_vsplit_cuda_float32, test/test_ops.py::TestCommonCUDA::test_out_warning__native_batch_norm_legit_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs__conversions_byte_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs__conversions_complex_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs__conversions_half_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs__conversions_long_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs__conversions_short_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_acos_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_add_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_alias_copy_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_amax_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_amin_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_asin_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_bitwise_and_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_block_diag_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_broadcast_to_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_clamp_min_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_cosh_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_diagonal_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_exp_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_exponential_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_fft_ifftn_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_fft_ifftshift_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_fft_ihfft_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_fft_irfft2_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_ge_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_hstack_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_imag_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_linalg_diagonal_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_linalg_matrix_norm_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_logical_and_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_logspace_tensor_overload_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_nn_functional_log_softmax_with_dtype_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_nn_functional_softshrink_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_rad2deg_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_ravel_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_repeat_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_round_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_rsqrt_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_special_bessel_j1_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_special_i1e_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_special_log_softmax_with_dtype_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_std_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_std_mean_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_sub_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_triu_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_unfold_copy_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning__refs_vdot_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_add_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_addcdiv_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_addr_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_amin_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_arange_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_argmax_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_atleast_3d_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_baddbmm_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_bernoulli_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_bincount_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_bitwise_and_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_bitwise_or_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_byte_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_clamp_max_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_clamp_min_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_column_stack_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_conj_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_corrcoef_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_cov_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_diag_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_diagflat_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_diagonal_scatter_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_div_floor_rounding_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_div_no_rounding_mode_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_double_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_eq_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_expand_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_fft_ifftn_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_fft_irfft_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_fft_rfft_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_floor_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_fmax_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_heaviside_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_histogramdd_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_i0_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_imag_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_isreal_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_linalg_cross_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_linalg_vander_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_log1p_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_log2_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_logspace_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_long_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_lu_solve_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_masked_argmax_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_masked_prod_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_min_binary_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_mvlgamma_mvlgamma_p_3_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nan_to_num_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_avg_pool2d_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_bilinear_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_binary_cross_entropy_with_logits_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_ctc_loss_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_interpolate_nearest-exact_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_leaky_relu_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_max_pool1d_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_max_unpool2d_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_rrelu_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_softmin_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_nn_functional_softplus_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_ones_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_pca_lowrank_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_pinverse_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_polygamma_polygamma_n_0_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_positive_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_qr_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_repeat_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_rot90_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_round_decimals_0_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_round_decimals_3_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_rsub_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_scalar_tensor_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_signal_windows_general_hamming_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_signbit_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_sin_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_airy_ai_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_bessel_j0_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_laguerre_polynomial_l_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_modified_bessel_i1_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_modified_bessel_k0_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_modified_bessel_k1_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_scaled_modified_bessel_k0_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_special_shifted_chebyshev_polynomial_w_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_split_with_sizes_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_sum_to_size_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_triu_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_unflatten_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_uniform_cuda, test/test_ops.py::TestCommonCUDA::test_out_warning_zero__cuda, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float___rdiv___cuda_bool, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_atan2_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_atanh_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_copysign_cuda_int64, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_cos_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_cos_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_cosh_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_deg2rad_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_erf_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_erfc_cuda_int64, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_erfinv_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_erfinv_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_exp2_cuda_bool, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_exp2_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_exp_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_float_power_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_float_power_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_ldexp_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_lgamma_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_log10_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_log2_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_log2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_logit_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_masked_mean_cuda_bool, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_masked_mean_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_masked_mean_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_masked_std_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_masked_std_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_masked_var_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_mvlgamma_mvlgamma_p_3_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_polygamma_polygamma_n_2_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_polygamma_polygamma_n_3_cuda_int64, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_polygamma_polygamma_n_3_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_polygamma_polygamma_n_4_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_rsqrt_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sigmoid_cuda_bool, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sin_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sinc_cuda_bool, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sinc_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sinc_cuda_int64, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sinh_cuda_bool, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sinh_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_chebyshev_polynomial_t_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_chebyshev_polynomial_w_cuda_bool, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_chebyshev_polynomial_w_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_chebyshev_polynomial_w_cuda_int64, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_laguerre_polynomial_l_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_shifted_chebyshev_polynomial_t_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_shifted_chebyshev_polynomial_u_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_shifted_chebyshev_polynomial_v_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_xlog1py_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_xlog1py_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_special_zeta_cuda_int16, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sqrt_cuda_int32, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_sqrt_cuda_int64, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_tan_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_tanh_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_xlogy_cuda_int8, test/test_ops.py::TestCommonCUDA::test_promotes_int_to_float_xlogy_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_T_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_T_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_T_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_bfloat16_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_bfloat16_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_bfloat16_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_bool_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_bool_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_byte_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_byte_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_byte_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_cdouble_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_cdouble_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_cfloat_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_cfloat_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_cfloat_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_chalf_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_chalf_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_chalf_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_chalf_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_char_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_char_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_char_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_char_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_char_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_double_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_double_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_double_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_double_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_double_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_float_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_float_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_half_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_half_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_int_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_int_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_int_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_int_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_long_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_short_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs__conversions_short_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_abs_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_abs_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_abs_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_acos_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_acosh_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_add_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_addcmul_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_addcmul_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_alias_copy_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_all_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_all_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_all_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_allclose_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_amax_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_amin_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_any_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_any_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_any_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_arange_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_arange_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_copy_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_copy_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_copy_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_copy_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_partial_views_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_scatter_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_as_strided_scatter_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_asin_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_asinh_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atan2_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atan_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atan_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atanh_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atleast_1d_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atleast_1d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atleast_3d_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_atleast_3d_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bitwise_and_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bitwise_not_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bitwise_or_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bitwise_right_shift_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bitwise_right_shift_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bitwise_xor_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_block_diag_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_broadcast_tensors_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_broadcast_tensors_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_broadcast_tensors_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_broadcast_to_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_broadcast_to_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bucketize_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bucketize_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_bucketize_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cat_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cat_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_chunk_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_chunk_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_chunk_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_clamp_max_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_clamp_max_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_clamp_min_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_clone_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_clone_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_clone_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_clone_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_conj_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_conj_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_conj_physical_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_conj_physical_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_contiguous_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_contiguous_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_copysign_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cos_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cos_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cosh_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_count_nonzero_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cumprod_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cumprod_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cumsum_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cumsum_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_cumsum_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_deg2rad_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diag_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diag_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diag_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diag_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diagonal_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diagonal_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diagonal_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diagonal_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_diagonal_scatter_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_digamma_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_div_floor_rounding_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_div_floor_rounding_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_div_floor_rounding_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_div_no_rounding_mode_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_div_no_rounding_mode_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_div_trunc_rounding_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_div_trunc_rounding_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_dsplit_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_dstack_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_empty_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_empty_like_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_empty_strided_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_empty_strided_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_empty_strided_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_eq_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_equal_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_equal_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_equal_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_equal_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_erfc_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_erfc_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_erfc_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_erfinv_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_erfinv_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_erfinv_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_exp2_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_exp2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_exp_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_exp_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_expand_as_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_expand_as_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_expand_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_expand_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_expand_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_expand_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_expm1_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_exponential_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_eye_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_eye_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_fft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_fft_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_fftn_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_fftn_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_fftn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_fftshift_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_hfft2_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_hfft_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_hfftn_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_ifft2_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_ifft2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_ifft_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_ifftn_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_ihfftn_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_irfft2_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_irfft_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_irfft_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_irfft_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfft2_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfft_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfft_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfft_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfftn_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfftn_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fft_rfftn_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fill_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fill_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fill_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_flatten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fliplr_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fliplr_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_flipud_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_flipud_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_floor_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_floor_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_floor_divide_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fmax_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_fmax_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_gcd_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_gcd_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_ge_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_heaviside_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_heaviside_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_heaviside_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_hsplit_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_hsplit_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_i0_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_i0_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_add_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_copy_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_copy_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_fill_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_fill_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_fill_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_fill_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_select_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_select_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_index_select_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isclose_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isclose_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isclose_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isfinite_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isinf_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isnan_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isnan_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isnan_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isnan_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isnan_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isneginf_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isneginf_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isposinf_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isposinf_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_isposinf_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_item_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_item_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_le_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_le_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_lerp_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_lerp_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_lgamma_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_lgamma_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linalg_cross_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linalg_cross_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linalg_diagonal_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linalg_diagonal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linalg_norm_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linalg_vecdot_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linspace_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linspace_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linspace_tensor_overload_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_linspace_tensor_overload_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_log10_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_log10_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_log2_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_log2_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_log_normal_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_and_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_and_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_and_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_not_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_not_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_or_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_or_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_xor_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logical_xor_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logspace_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logspace_tensor_overload_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logsumexp_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logsumexp_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_logsumexp_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_lt_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_lt_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_masked_fill_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_meshgrid_list_of_tensors_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_meshgrid_list_of_tensors_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_meshgrid_list_of_tensors_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_meshgrid_list_of_tensors_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_meshgrid_list_of_tensors_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_meshgrid_variadic_tensors_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_meshgrid_variadic_tensors_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_movedim_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_mul_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nan_to_num_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nan_to_num_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_narrow_copy_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_narrow_copy_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_narrow_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_ne_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_ne_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_ne_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_neg_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_new_empty_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_new_empty_strided_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_new_zeros_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_new_zeros_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_new_zeros_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nextafter_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_celu_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_celu_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_dropout_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_dropout_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_dropout_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_elu_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_glu_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_hardshrink_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_hinge_embedding_loss_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_layer_norm_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_log_softmax_with_dtype_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_log_softmax_with_dtype_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_nll_loss_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_pairwise_distance_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_pixel_shuffle_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_pixel_shuffle_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_pixel_unshuffle_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_pixel_unshuffle_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_pixel_unshuffle_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_poisson_nll_loss_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu6_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu6_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu6_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu6_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_relu_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_softmax_with_dtype_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_softmax_with_dtype_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_softmin_with_dtype_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_softshrink_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_tanhshrink_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_tanhshrink_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_tanhshrink_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_threshold_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_nn_functional_triplet_margin_loss_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_normal_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_normal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_normal_number_mean_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_ones_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_permute_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_positive_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_positive_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_positive_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_prod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_prod_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_prod_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_randn_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_ravel_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_real_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_real_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reciprocal_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_renorm_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_repeat_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_repeat_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_repeat_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reshape_as_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reshape_as_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reshape_as_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reshape_as_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reshape_as_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reshape_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_reshape_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_roll_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_roll_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_roll_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_roll_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_roll_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_rot90_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_rot90_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_rsqrt_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_rsqrt_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_rsqrt_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_rsub_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sigmoid_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sigmoid_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sign_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_signbit_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sin_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sin_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sin_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sin_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sin_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sinc_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sinc_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sinh_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sinh_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_softmax_with_dtype_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_softmax_with_dtype_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_bessel_j0_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_bessel_j1_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_bessel_j1_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_entr_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_i0e_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_i0e_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_i0e_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_i1_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_log_ndtr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_log_ndtr_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_log_softmax_with_dtype_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_logit_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_multigammaln_mvlgamma_p_1_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_multigammaln_mvlgamma_p_1_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_multigammaln_mvlgamma_p_3_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_ndtr_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_ndtr_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_ndtri_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_softmax_with_dtype_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_softmax_with_dtype_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_softmax_with_dtype_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_softmax_with_dtype_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_xlog1py_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_xlog1py_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_xlog1py_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_special_zeta_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sqrt_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sqrt_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_square_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_square_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_squeeze_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_squeeze_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_squeeze_multiple_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_squeeze_multiple_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_stack_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_stack_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_stack_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_std_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_std_mean_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sub_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sub_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sum_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_sum_to_size_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_t_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_t_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_take_along_dim_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_tan_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_tan_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_tanh_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_tanh_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_tensor_split_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_trace_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_trace_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_trace_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_trace_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_transpose_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_transpose_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_tril_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_tril_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_triu_indices_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_true_divide_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_true_divide_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_trunc_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_trunc_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unbind_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unflatten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unflatten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unfold_copy_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unfold_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unfold_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unfold_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unsqueeze_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_unsqueeze_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_var_mean_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_vdot_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_view_as_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_view_as_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_view_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_view_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_vsplit_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_vstack_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_where_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_where_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_where_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_xlogy_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_zeros_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_zeros_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref__refs_zeros_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_T_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_amax_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_as_strided_scatter_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_cat_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_diagonal_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_div_floor_rounding_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_div_trunc_rounding_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_dsplit_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_eq_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_eye_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_fft_fftn_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_fft_ihfft_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_fft_ihfftn_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_float_power_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_hstack_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_igammac_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_index_select_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_logaddexp_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_logical_and_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_logical_xor_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_logspace_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_movedim_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_nn_functional_hinge_embedding_loss_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_reshape_as_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_sum_to_size_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_t_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_unbind_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_view_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_errors__refs_where_cuda, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_T_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_bfloat16_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_bfloat16_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_bfloat16_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_bfloat16_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_bfloat16_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_bool_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_bool_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_byte_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_byte_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_cdouble_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_cdouble_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_cdouble_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_cdouble_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_cdouble_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_cfloat_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_chalf_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_char_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_double_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_float_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_half_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_half_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_long_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_polar_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_short_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_short_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_short_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs__conversions_short_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_abs_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_abs_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_abs_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_acos_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_acos_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_acos_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_acosh_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_acosh_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_acosh_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_acosh_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_add_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_add_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_add_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_add_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_addcdiv_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_addcdiv_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_addcdiv_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_addcmul_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_addr_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_addr_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_alias_copy_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_all_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_all_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_amax_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_amax_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_any_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_any_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_any_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_any_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_arange_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_arange_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_as_strided_copy_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_as_strided_copy_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_as_strided_copy_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_as_strided_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_as_strided_partial_views_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_as_strided_partial_views_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_as_strided_partial_views_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_asin_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_asin_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_asin_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_asin_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_asinh_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_atleast_1d_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_atleast_1d_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_atleast_1d_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_atleast_2d_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_atleast_3d_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_atleast_3d_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_bitwise_left_shift_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_bitwise_or_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_bitwise_xor_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_block_diag_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_block_diag_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_broadcast_shapes_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_broadcast_tensors_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_broadcast_tensors_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_broadcast_tensors_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_broadcast_to_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_broadcast_to_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cat_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_chunk_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_chunk_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clamp_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clamp_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clamp_max_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clamp_min_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clamp_min_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clone_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clone_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clone_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clone_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_clone_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_column_stack_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_column_stack_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_conj_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_conj_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_conj_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_conj_physical_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_constant_pad_nd_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_constant_pad_nd_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_constant_pad_nd_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_contiguous_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_copysign_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_copysign_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cos_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cos_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cos_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cosh_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cumprod_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cumsum_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cumsum_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_cumsum_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_deg2rad_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_deg2rad_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_deg2rad_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_deg2rad_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diag_embed_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diag_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diag_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diag_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diagonal_copy_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diagonal_copy_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diagonal_copy_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diagonal_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_diagonal_scatter_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_digamma_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_div_floor_rounding_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_div_floor_rounding_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_div_no_rounding_mode_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_div_trunc_rounding_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_dot_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_dot_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_dsplit_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_dsplit_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_dstack_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_empty_like_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_empty_like_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_empty_strided_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_erf_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_erfinv_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_erfinv_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_erfinv_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_exp2_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_exp2_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_exp2_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_exp_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_expand_as_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_expm1_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_exponential_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_eye_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_fft2_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_fftshift_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_fftshift_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_fftshift_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_hfftn_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_hfftn_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ifft2_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ifft2_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ifft_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ifftn_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ifftn_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ifftshift_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ihfft2_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ihfft2_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ihfft_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ihfft_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ihfftn_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_ihfftn_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_irfft2_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_irfft2_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_irfft_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_irfft_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_irfftn_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_rfft2_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_rfft2_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_rfft_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_rfft_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_rfftn_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fft_rfftn_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fill_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_flatten_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_flatten_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_flatten_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_flip_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fliplr_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_flipud_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_flipud_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_flipud_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_float_power_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_float_power_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_floor_divide_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_floor_divide_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_floor_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_floor_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fmax_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fmax_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fmax_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fmin_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fmin_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_fmod_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_frac_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_gcd_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ge_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ge_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_geometric_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_gt_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_gt_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_heaviside_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_hsplit_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_hsplit_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_hypot_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_imag_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_index_add_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_index_copy_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_index_copy_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_index_fill_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_index_fill_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isclose_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isclose_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isclose_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isfinite_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isfinite_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isfinite_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isinf_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isinf_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isinf_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isinf_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isinf_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isinf_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isnan_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isnan_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isreal_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isreal_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_isreal_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_item_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_item_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_le_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_le_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_lerp_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_lerp_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_lgamma_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_lgamma_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_cross_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_cross_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_cross_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_cross_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_diagonal_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_matrix_norm_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_matrix_norm_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_norm_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_norm_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_svd_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_vecdot_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_vecdot_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_vector_norm_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_linalg_vector_norm_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log1p_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log1p_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log1p_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log2_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_log_softmax_with_dtype_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logaddexp2_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logaddexp_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logical_not_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logical_or_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logical_or_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logspace_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logspace_tensor_overload_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logspace_tensor_overload_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_logsumexp_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_lt_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_masked_fill_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_maximum_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_meshgrid_list_of_tensors_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_meshgrid_variadic_tensors_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_minimum_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_movedim_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_movedim_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_mul_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nan_to_num_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nan_to_num_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_narrow_copy_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_narrow_copy_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_narrow_copy_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_narrow_copy_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_narrow_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_narrow_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_native_layer_norm_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ne_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ne_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ne_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_neg_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_neg_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_empty_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_empty_strided_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_empty_strided_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_full_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_full_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_ones_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_zeros_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_new_zeros_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_alpha_dropout_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_celu_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_dropout_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_hardshrink_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_hardtanh_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_hardtanh_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_layer_norm_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_leaky_relu_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_leaky_relu_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_log_softmax_with_dtype_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_log_softmax_with_dtype_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_margin_ranking_loss_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_margin_ranking_loss_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_nll_loss_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pairwise_distance_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pairwise_distance_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pixel_shuffle_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pixel_shuffle_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pixel_shuffle_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pixel_shuffle_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pixel_shuffle_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_pixel_shuffle_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_prelu_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_relu6_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_smooth_l1_loss_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_softmax_with_dtype_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_softmax_with_dtype_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_softmin_with_dtype_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_softmin_with_dtype_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_softshrink_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_softshrink_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_threshold_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_triplet_margin_loss_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_triplet_margin_loss_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_nn_functional_triplet_margin_loss_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_normal_number_mean_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ones_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ones_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_ones_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_permute_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_positive_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_positive_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_positive_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_pow_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_pow_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_pow_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_pow_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_prod_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_prod_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_rad2deg_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_rad2deg_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_randn_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_randn_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_reciprocal_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_reciprocal_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_remainder_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_remainder_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_renorm_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_renorm_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_renorm_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_reshape_as_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_reshape_as_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_reshape_as_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_reshape_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_reshape_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_roll_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_roll_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_roll_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_rot90_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_round_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_round_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_round_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_rsqrt_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_rsqrt_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_rsub_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_rsub_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_select_scatter_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sgn_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sgn_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sgn_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sigmoid_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sigmoid_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sign_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_signbit_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_signbit_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_signbit_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_signbit_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sin_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sin_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sinc_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sinc_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sinc_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sinh_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sinh_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_softmax_with_dtype_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_bessel_j0_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_bessel_j0_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_bessel_j1_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_bessel_j1_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_entr_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_entr_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_i1_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_log_softmax_with_dtype_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_logit_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_logit_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_multigammaln_mvlgamma_p_1_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_multigammaln_mvlgamma_p_3_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_multigammaln_mvlgamma_p_3_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_multigammaln_mvlgamma_p_5_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_ndtr_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_ndtr_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_ndtr_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_ndtri_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_spherical_bessel_j0_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_spherical_bessel_j0_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_xlog1py_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_special_zeta_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sqrt_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sqrt_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sqrt_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_square_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_square_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_square_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_squeeze_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_squeeze_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_squeeze_multiple_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_stack_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_std_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_std_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_std_mean_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_stft_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sub_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sum_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sum_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sum_to_size_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_sum_to_size_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_t_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_t_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_t_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_take_along_dim_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_take_along_dim_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tan_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tan_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tan_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tan_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tanh_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tanh_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tanh_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tensor_split_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tensor_split_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tensor_split_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_to_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_to_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_trace_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_transpose_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_transpose_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tril_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_tril_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_triu_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_triu_indices_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_true_divide_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_true_divide_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_trunc_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unbind_executor_aten_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unbind_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unflatten_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unflatten_executor_aten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unfold_copy_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unfold_executor_aten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unfold_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_unfold_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_var_mean_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_vdot_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_view_as_executor_aten_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_view_executor_aten_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_view_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_vsplit_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_vsplit_executor_aten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_vstack_executor_aten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_vstack_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_where_executor_aten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_where_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_xlogy_executor_aten_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_xlogy_executor_aten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_xlogy_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_zeros_executor_aten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_zeros_executor_aten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_executor__refs_zeros_executor_aten_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_T_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_T_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_T_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_T_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_bool_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_cdouble_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_cdouble_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_chalf_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_chalf_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_chalf_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_char_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_char_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_double_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_double_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_float_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_half_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_int_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_long_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_long_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs__conversions_polar_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_abs_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_abs_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_abs_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_acos_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_acos_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_acosh_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_acosh_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_acosh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_acosh_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_add_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_add_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_add_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_add_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_addr_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_alias_copy_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_alias_copy_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_alias_copy_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_all_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_all_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_all_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_all_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_all_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_allclose_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_allclose_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_allclose_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_amin_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_any_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_any_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_arange_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_copy_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_copy_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_scatter_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_scatter_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_as_strided_scatter_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_asin_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_asin_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_asinh_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_asinh_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_asinh_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_asinh_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atan2_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atan2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atan2_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atan_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atanh_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atanh_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atanh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atleast_1d_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atleast_1d_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atleast_2d_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atleast_2d_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_atleast_3d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bitwise_left_shift_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bitwise_or_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bitwise_right_shift_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bitwise_right_shift_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bitwise_xor_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bitwise_xor_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_block_diag_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_block_diag_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_block_diag_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_broadcast_tensors_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_broadcast_tensors_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_broadcast_to_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bucketize_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_bucketize_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cauchy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cauchy_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_ceil_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_chunk_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_chunk_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_clamp_max_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_clone_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_clone_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_column_stack_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_column_stack_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_column_stack_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_conj_physical_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_constant_pad_nd_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_constant_pad_nd_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_contiguous_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_contiguous_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_copysign_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_copysign_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_copysign_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cos_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cos_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cos_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cosh_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cosh_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cosh_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_count_nonzero_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_count_nonzero_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cumprod_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cumprod_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cumsum_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cumsum_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_cumsum_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_deg2rad_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diag_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diag_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diag_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diag_embed_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diag_embed_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diagonal_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diagonal_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_diagonal_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_digamma_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_digamma_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_div_no_rounding_mode_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_div_no_rounding_mode_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_div_no_rounding_mode_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_div_trunc_rounding_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_dsplit_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_dstack_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_dstack_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_like_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_like_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_like_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_like_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_strided_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_empty_strided_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_eq_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_eq_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_eq_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_equal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_erf_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_erfc_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_erfc_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_erfinv_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_exp2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_exp_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_exp_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_exp_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_expand_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_expand_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_expand_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_expm1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_expm1_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_eye_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_eye_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_eye_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_eye_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fft2_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fft2_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fft2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fft_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fftn_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fftn_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fftshift_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fftshift_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_fftshift_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_hfft2_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_hfft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_hfft_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ifft_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ifft_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ifftn_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ifftshift_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ifftshift_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ihfft2_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ihfft2_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_ihfft_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_irfft2_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_irfft_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_irfft_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_irfft_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_irfftn_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_rfft2_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_rfft2_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_rfft_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fft_rfft_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fill_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fill_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fill_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fill_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_flatten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_flatten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_flatten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_flatten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_flip_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fliplr_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_flipud_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_float_power_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_floor_divide_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_floor_divide_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fmax_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fmax_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fmax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fmax_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fmod_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_fmod_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_frexp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_geometric_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_hsplit_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_hstack_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_i0_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_i0_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_igamma_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_index_add_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_index_fill_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_index_fill_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isclose_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isclose_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isfinite_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isinf_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isnan_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isneginf_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isneginf_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isposinf_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isreal_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_isreal_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_istft_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_istft_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_item_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_item_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_le_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_le_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_le_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_lerp_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_lerp_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_lerp_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linalg_diagonal_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linalg_matrix_norm_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linalg_matrix_norm_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linalg_norm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linalg_vector_norm_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linspace_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linspace_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linspace_tensor_overload_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_linspace_tensor_overload_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log10_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log1p_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log1p_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log_normal_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log_softmax_with_dtype_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log_softmax_with_dtype_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_log_softmax_with_dtype_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logaddexp2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logaddexp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logical_and_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logical_and_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logical_not_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logical_or_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logical_xor_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logspace_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logspace_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logspace_tensor_overload_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logsumexp_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logsumexp_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_logsumexp_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_masked_fill_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_masked_fill_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_maximum_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_maximum_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_meshgrid_list_of_tensors_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_meshgrid_list_of_tensors_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_meshgrid_list_of_tensors_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_meshgrid_variadic_tensors_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_meshgrid_variadic_tensors_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_minimum_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_minimum_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_movedim_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_mul_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_mul_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nan_to_num_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nan_to_num_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nan_to_num_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_narrow_copy_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_narrow_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_ne_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_neg_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_empty_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_empty_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_empty_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_empty_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_empty_strided_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_empty_strided_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_empty_strided_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_full_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_ones_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_ones_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_new_zeros_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_alpha_dropout_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_celu_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_celu_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_elu_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_hardtanh_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_log_softmax_with_dtype_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_log_softmax_with_dtype_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_mse_loss_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_pairwise_distance_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_pixel_shuffle_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_pixel_shuffle_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_pixel_unshuffle_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_prelu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_relu6_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_selu_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_softmax_with_dtype_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_softmax_with_dtype_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_softmax_with_dtype_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_softmax_with_dtype_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_softmin_with_dtype_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_softmin_with_dtype_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_softplus_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_threshold_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_nn_functional_triplet_margin_loss_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_norm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_normal_number_mean_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_ones_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_ones_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_permute_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_permute_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_permute_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_permute_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_positive_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_pow_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_prod_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_prod_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_prod_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_prod_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rad2deg_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rad2deg_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_ravel_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_ravel_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_real_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_real_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_reciprocal_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_remainder_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_remainder_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_remainder_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_remainder_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_renorm_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_repeat_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_repeat_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_repeat_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_reshape_as_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_reshape_as_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_reshape_as_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_reshape_as_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_reshape_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_reshape_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_roll_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rot90_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rot90_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rot90_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rot90_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_round_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_round_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rsqrt_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_rsqrt_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sgn_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sigmoid_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sigmoid_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sign_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sign_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sign_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_signbit_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_signbit_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sin_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sin_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sin_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sin_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sin_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sinc_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sinc_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sinc_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sinh_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sinh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sinh_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sinh_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_softmax_with_dtype_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_softmax_with_dtype_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_bessel_j0_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_bessel_j1_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_entr_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_entr_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_erfcx_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_erfcx_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_i0e_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_i1_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_i1_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_i1e_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_log_ndtr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_log_softmax_with_dtype_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_logit_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_multigammaln_mvlgamma_p_1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_multigammaln_mvlgamma_p_3_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_multigammaln_mvlgamma_p_5_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_multigammaln_mvlgamma_p_5_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_ndtr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_ndtr_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_ndtri_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_softmax_with_dtype_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_softmax_with_dtype_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_softmax_with_dtype_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_softmax_with_dtype_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_spherical_bessel_j0_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_spherical_bessel_j0_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_spherical_bessel_j0_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_xlog1py_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_special_xlog1py_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sqrt_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sqrt_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sqrt_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_square_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_square_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_square_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_squeeze_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_squeeze_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_squeeze_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_squeeze_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_squeeze_multiple_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_std_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_std_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_stft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sum_to_size_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_sum_to_size_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_t_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_t_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tan_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tanh_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tanh_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tanh_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tanh_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tensor_split_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tensor_split_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_trace_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_transpose_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_transpose_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tril_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_tril_indices_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_triu_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_triu_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_true_divide_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_true_divide_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_trunc_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_trunc_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unbind_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unbind_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unflatten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unflatten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unflatten_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unflatten_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unflatten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unfold_copy_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unfold_copy_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unfold_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unfold_copy_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unfold_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unfold_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unsqueeze_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unsqueeze_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unsqueeze_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unsqueeze_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_unsqueeze_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_var_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_var_mean_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_view_as_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_view_as_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_view_as_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_view_as_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_vsplit_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_vstack_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_where_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_where_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_xlogy_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_meta__refs_zeros_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_bool_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_bool_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_byte_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_cdouble_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_cdouble_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_cfloat_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_char_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_complex_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_double_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_float_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_float_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_half_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_half_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_int_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_int_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_long_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_long_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_long_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_polar_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_short_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs__conversions_short_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_abs_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_acos_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_acosh_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_add_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_addcdiv_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_addcmul_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_addr_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_addr_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_alias_copy_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_alias_copy_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_amax_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_arange_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_copy_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_copy_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_copy_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_as_strided_scatter_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_asin_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_asinh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_asinh_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atan_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atan_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atanh_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atanh_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atanh_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atleast_1d_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atleast_1d_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atleast_1d_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atleast_2d_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atleast_3d_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_atleast_3d_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_bitwise_and_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_bitwise_left_shift_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_bitwise_not_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_bitwise_or_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_bitwise_right_shift_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_bitwise_xor_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_block_diag_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_block_diag_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_broadcast_tensors_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_broadcast_tensors_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_broadcast_tensors_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_broadcast_tensors_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_broadcast_to_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cat_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ceil_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ceil_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ceil_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ceil_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_chunk_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_chunk_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clamp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clamp_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clamp_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clamp_max_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clamp_min_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clamp_min_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clone_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clone_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clone_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_clone_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_conj_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_conj_physical_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_conj_physical_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_constant_pad_nd_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_constant_pad_nd_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_contiguous_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_contiguous_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_contiguous_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_contiguous_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_copysign_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_copysign_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cos_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cos_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cosh_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_count_nonzero_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_count_nonzero_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cumprod_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cumprod_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cumprod_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cumsum_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cumsum_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cumsum_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_cumsum_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_deg2rad_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_diag_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_diag_embed_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_diagonal_copy_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_diagonal_copy_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_diagonal_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_diagonal_scatter_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_digamma_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_div_floor_rounding_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_div_floor_rounding_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_div_no_rounding_mode_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_dsplit_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_dstack_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_dstack_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_empty_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_empty_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_empty_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_empty_strided_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_eq_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_eq_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_eq_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_eq_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_equal_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_equal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_equal_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_erf_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_erf_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_erfc_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_expand_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_expand_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_expm1_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_expm1_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_eye_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_fft_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_fft_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_fft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_fftn_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_fftn_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_fftshift_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfft2_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfft2_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfft2_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfft2_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfft_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfft_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfftn_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_hfftn_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifft2_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifft_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifft_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifft_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifftn_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifftn_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifftn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifftn_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifftn_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ifftshift_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ihfft2_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ihfft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ihfft_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ihfft_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ihfft_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ihfft_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_ihfftn_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_irfft2_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_irfft2_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_irfft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_irfft_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_irfftn_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_irfftn_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fft_irfftn_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fill_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fill_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flatten_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flatten_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flatten_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flatten_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flip_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flip_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fliplr_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flipud_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_flipud_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_float_power_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_float_power_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_float_power_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_float_power_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_float_power_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_float_power_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_floor_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_floor_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_floor_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_floor_divide_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fmax_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fmod_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fmod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fmod_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_fmod_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_gcd_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ge_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ge_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ge_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ge_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_geometric_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_gt_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_heaviside_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_hsplit_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_hstack_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_hstack_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_hstack_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_i0_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_index_add_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_index_add_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_index_copy_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_index_select_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isclose_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isclose_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isfinite_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isfinite_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isfinite_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isinf_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isinf_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isinf_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isinf_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isnan_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isposinf_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isposinf_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isreal_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_isreal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_item_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_item_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_le_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_le_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_lerp_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_lerp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_lerp_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_lgamma_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_cross_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_cross_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_cross_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_diagonal_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_diagonal_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_matrix_norm_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_matrix_norm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_norm_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_vecdot_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_vector_norm_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linalg_vector_norm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linspace_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linspace_tensor_overload_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_linspace_tensor_overload_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_log10_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_log1p_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_log1p_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_log1p_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_log2_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_log_softmax_with_dtype_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logaddexp_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logaddexp_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logical_or_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logical_xor_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logspace_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logspace_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logspace_tensor_overload_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logspace_tensor_overload_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logspace_tensor_overload_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logspace_tensor_overload_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logsumexp_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logsumexp_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_logsumexp_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_masked_fill_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_maximum_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_maximum_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_maximum_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_meshgrid_list_of_tensors_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_meshgrid_variadic_tensors_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_meshgrid_variadic_tensors_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_minimum_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_movedim_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_mul_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_mul_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nan_to_num_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nan_to_num_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nan_to_num_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_narrow_copy_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_narrow_copy_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_narrow_copy_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ne_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ne_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ne_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ne_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_neg_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_empty_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_empty_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_empty_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_empty_strided_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_full_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_ones_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_zeros_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_zeros_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_new_zeros_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_alpha_dropout_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_celu_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_elu_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_hardshrink_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_l1_loss_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_log_softmax_with_dtype_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_log_softmax_with_dtype_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_log_softmax_with_dtype_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_margin_ranking_loss_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_mse_loss_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_pairwise_distance_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_pairwise_distance_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_pairwise_distance_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_pixel_shuffle_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_pixel_unshuffle_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_pixel_unshuffle_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_prelu_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_relu6_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_selu_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_selu_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_selu_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_softmax_with_dtype_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_softmin_with_dtype_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_softmin_with_dtype_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_tanhshrink_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_nn_functional_threshold_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_norm_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_norm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_normal__in_place_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_normal__in_place_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ones_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_ones_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_permute_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_positive_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_positive_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_positive_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_pow_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_prod_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_prod_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_prod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_prod_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_rad2deg_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_randn_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_real_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_reciprocal_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_remainder_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_reshape_as_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_reshape_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_reshape_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_roll_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_round_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_round_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_rsqrt_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_rsqrt_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_rsub_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_rsub_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_select_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sigmoid_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sigmoid_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sign_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_signbit_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_signbit_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_signbit_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_signbit_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_signbit_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_signbit_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sin_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sin_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sinc_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sinc_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_softmax_with_dtype_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_softmax_with_dtype_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_bessel_j0_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_bessel_j0_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_bessel_j1_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_bessel_j1_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_entr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_entr_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_erfcx_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_i1_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_i1e_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_i1e_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_log_softmax_with_dtype_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_multigammaln_mvlgamma_p_3_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_multigammaln_mvlgamma_p_3_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_ndtr_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_ndtr_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_ndtri_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_softmax_with_dtype_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_softmax_with_dtype_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_softmax_with_dtype_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_spherical_bessel_j0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_spherical_bessel_j0_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_spherical_bessel_j0_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_xlog1py_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_zeta_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_zeta_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_zeta_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_special_zeta_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sqrt_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_square_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_squeeze_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_squeeze_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_squeeze_multiple_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_squeeze_multiple_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_stack_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_stack_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_std_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_std_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_std_mean_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_std_mean_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_stft_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_stft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sum_cuda_float16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sum_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sum_to_size_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sum_to_size_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_sum_to_size_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_t_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_take_along_dim_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_tanh_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_tanh_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_tanh_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_tensor_split_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_to_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_trace_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_trace_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_transpose_cuda_float64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_tril_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_triu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_triu_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_true_divide_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_trunc_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_trunc_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unbind_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unbind_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unbind_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unflatten_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unflatten_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unfold_copy_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unfold_copy_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unfold_copy_cuda_int8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unsqueeze_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unsqueeze_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unsqueeze_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_unsqueeze_cuda_uint8, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_var_mean_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_var_mean_cuda_float32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_vdot_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_view_as_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_view_as_cuda_int32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_view_as_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_view_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_view_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_view_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_vsplit_cuda_int64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_vstack_cuda_int16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_where_cuda_bfloat16, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_where_cuda_complex128, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_where_cuda_complex32, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_where_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_xlogy_cuda_bool, test/test_ops.py::TestCommonCUDA::test_python_ref_torch_fallback__refs_xlogy_cuda_int8, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_T_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager___radd___cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager___rdiv___cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager___rmatmul___cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager__segment_reduce_lengths_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager__unsafe_masked_index_put_accumulate_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_abs_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_acosh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_addcdiv_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_addmm_decomposed_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_addr_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_addr_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_any_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_argmin_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_as_strided_partial_views_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_as_strided_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_asinh_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_baddbmm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_broadcast_shapes_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_broadcast_to_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_cartesian_prod_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_conj_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_corrcoef_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_count_nonzero_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_diagflat_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_diagonal_scatter_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_diff_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_div_no_rounding_mode_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_dot_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_einsum_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_empty_like_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_empty_permuted_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_exp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_expand_as_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_eye_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_fft_fft_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_fft_fftshift_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_fft_ifft2_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_fft_ifftn_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_fft_ihfft2_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_fft_irfft_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_flip_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_full_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_gather_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_half_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_heaviside_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_igamma_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_isfinite_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_isneginf_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_jiterator_4inputs_with_extra_args_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_cholesky_ex_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_cross_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_eigh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_eigvals_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_ldl_factor_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_lstsq_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_multi_dot_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_norm_subgradients_at_zero_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_pinv_singular_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_slogdet_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_vander_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_linalg_vector_norm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_log_normal_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_logical_or_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_long_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_lt_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_lu_solve_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_masked_logaddexp_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_masked_mean_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_masked_normalize_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_masked_sum_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_meshgrid_variadic_tensors_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_movedim_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nanquantile_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_narrow_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_narrow_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_ne_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_new_ones_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_adaptive_avg_pool1d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_conv_transpose2d_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_cosine_embedding_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_ctc_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_elu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_gaussian_nll_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_hardshrink_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_hardtanh_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_l1_loss_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_max_pool2d_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_normalize_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_pad_reflect_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_pairwise_distance_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_pdist_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_pixel_unshuffle_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_selu_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_smooth_l1_loss_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_softshrink_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_softsign_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_tanhshrink_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nn_functional_unfold_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nonzero_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_nonzero_static_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_norm_fro_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_normal_in_place_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_normal_number_mean_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_ones_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_pinverse_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_polygamma_polygamma_n_3_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_positive_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_put_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_rand_like_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_rand_like_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_randint_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_real_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_renorm_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_repeat_interleave_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_reshape_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_resolve_neg_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_roll_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_rot90_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_select_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_select_scatter_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_sigmoid_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_sigmoid_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_signal_windows_blackman_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_signal_windows_kaiser_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_slice_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_sparse_sampled_addmm_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_special_bessel_j0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_special_i1e_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_special_modified_bessel_k0_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_special_zeta_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_split_list_args_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_split_with_sizes_copy_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_squeeze_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_std_mean_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_std_mean_unbiased_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_sum_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_to_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_to_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_triangular_solve_cuda_complex64, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_triangular_solve_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_tril_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_unfold_copy_cuda_float32, test/test_ops.py::TestCommonCUDA::test_variant_consistency_eager_unsafe_split_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward___rmod___cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward___rpow___cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_acosh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_atan2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_broadcast_tensors_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_chalf_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_clamp_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_clamp_max_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_conj_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_conj_physical_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_cummin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_digamma_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_div_floor_rounding_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_double_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_fft_fftshift_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_fft_hfft2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_fft_hfft_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_fft_ihfft_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_fft_irfftn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_fliplr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_fmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_grid_sampler_2d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_hstack_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_linalg_cholesky_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_linalg_det_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_linalg_eig_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_linalg_inv_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_linalg_lu_factor_ex_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_linalg_solve_ex_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_linalg_svd_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_logaddexp2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_logsumexp_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_lu_unpack_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_masked_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_max_pool2d_with_indices_backward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_movedim_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_mvlgamma_mvlgamma_p_3_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nanmedian_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nanquantile_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_adaptive_avg_pool1d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_alpha_dropout_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_avg_pool1d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_avg_pool3d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_conv_transpose1d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_ctc_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_embedding_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_fractional_max_pool3d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_gaussian_nll_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_hardtanh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_interpolate_bicubic_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_interpolate_trilinear_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_max_unpool3d_grad_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_multi_head_attention_forward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_pad_circular_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_pixel_shuffle_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_prelu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_rms_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_nn_functional_silu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_ormqr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_polygamma_polygamma_n_2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_pow_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_real_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_reciprocal_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_repeat_interleave_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_scatter_reduce_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_select_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_slice_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_square_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_squeeze_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_squeeze_multiple_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_t_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_tensor_split_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_to_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_trace_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_true_divide_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_var_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_backward_vdot_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input___radd___cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input___rmatmul___cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_amin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_asin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_atleast_2d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_bfloat16_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_bmm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_broadcast_to_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_bucketize_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_byte_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_cdouble_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_ceil_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_cholesky_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_clamp_max_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_copysign_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_cosh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_count_nonzero_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_diag_embed_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_div_floor_rounding_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_div_trunc_rounding_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_dot_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_double_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_einsum_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_fft_hfft_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_fft_ihfftn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_fft_rfft2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_fill_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_flipud_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_floor_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_floor_divide_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_geqrf_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_gradient_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_gt_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_histc_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_index_put_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_index_reduce_amin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_index_reduce_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_isin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_isnan_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_linalg_eigvalsh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_linalg_matrix_power_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_linalg_multi_dot_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_linalg_solve_ex_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_log_normal_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_logspace_tensor_overload_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_lu_solve_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_masked_median_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_masked_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_masked_scatter_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_masked_softmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_masked_sum_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_max_reduction_with_dim_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_multinomial_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nanmedian_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_ne_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_new_full_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_adaptive_avg_pool3d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_adaptive_max_pool1d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_adaptive_max_pool2d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_batch_norm_without_cudnn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_glu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_interpolate_nearest-exact_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_kl_div_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_margin_ranking_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_multi_head_attention_forward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_multi_margin_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_nn_functional_silu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_norm_fro_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_normal_number_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_real_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_remainder_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_reshape_as_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_scatter_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_scatter_reduce_amin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_sgn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_sign_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_signal_windows_general_hamming_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_signal_windows_nuttall_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_signbit_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_special_hermite_polynomial_h_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_special_log_ndtr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_special_modified_bessel_k0_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_special_zeta_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_split_list_args_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_squeeze_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_sum_to_size_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_tan_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_tile_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_to_sparse_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_trunc_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_cow_input_view_copy_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad___rmatmul___cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad__chunk_cat_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad__segment_reduce_offsets_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_addcmul_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_addmm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_angle_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_asin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_bfloat16_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_cholesky_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_chunk_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_clamp_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_complex_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_conj_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_copysign_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_cummin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_cumulative_trapezoid_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_empty_strided_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_erfc_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_expand_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_fft_hfft_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_fft_hfftn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_fft_ihfftn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_fft_rfftn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_flatten_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_fliplr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_flipud_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_fmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_full_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_index_reduce_amax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_index_reduce_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_index_select_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_isneginf_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_jiterator_4inputs_with_extra_args_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_kthvalue_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_ldexp_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_linalg_multi_dot_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_linalg_solve_triangular_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_linalg_tensorinv_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_log_softmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_logaddexp2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_logical_and_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_lu_unpack_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_masked_amin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_masked_median_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_max_binary_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_median_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_meshgrid_list_of_tensors_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_meshgrid_variadic_tensors_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nan_to_num_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nanmean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_alpha_dropout_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_bilinear_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_channel_shuffle_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_embedding_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_hardswish_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_interpolate_bicubic_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_max_unpool1d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_max_unpool1d_grad_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_max_unpool3d_grad_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_multi_head_attention_forward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_pixel_shuffle_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_prelu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_nn_functional_softmin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_normal_number_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_ones_like_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_permute_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_pinverse_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_polygamma_polygamma_n_2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_repeat_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_resolve_conj_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_resolve_neg_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_scatter_reduce_amax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_scatter_reduce_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_searchsorted_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_select_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_signal_windows_exponential_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_signal_windows_hann_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_sin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_sinc_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_special_hermite_polynomial_h_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_special_i1e_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_special_legendre_polynomial_p_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_special_modified_bessel_k1_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_special_scaled_modified_bessel_k1_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_special_shifted_chebyshev_polynomial_u_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_sqrt_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_sum_to_size_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_svd_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_transpose_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_trapezoid_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_uniform_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_forward_ad_zeros_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_H_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator___rsub___cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator__unsafe_masked_index_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator__unsafe_masked_index_put_accumulate_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_addmv_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_addr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_any_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_as_strided_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_atanh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_byte_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_cartesian_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_combinations_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_diag_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_diff_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_dist_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_div_no_rounding_mode_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_dstack_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_empty_strided_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_fft_hfft2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_fft_ifftshift_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_fft_ihfftn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_fft_rfft2_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_fft_rfft_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_fill_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_full_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_gather_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_gradient_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_half_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_isposinf_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_jiterator_4inputs_with_extra_args_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_linalg_cross_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_linalg_eig_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_linalg_lstsq_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_linalg_lu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_linalg_tensorinv_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_linspace_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_log10_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_log_softmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_logical_or_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_logical_xor_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_logspace_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_logspace_tensor_overload_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_lu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_masked_cumprod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_masked_log_softmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_masked_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_masked_scatter_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_masked_select_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_masked_sum_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_max_pool2d_with_indices_backward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_max_reduction_no_dim_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nan_to_num_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_narrow_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_ne_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_new_full_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_conv_transpose2d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_conv_transpose3d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_ctc_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_gelu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_group_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_hardshrink_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_hardtanh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_interpolate_bicubic_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_interpolate_linear_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_leaky_relu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_local_response_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_max_unpool1d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_max_unpool1d_grad_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_multi_head_attention_forward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_nll_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_normalize_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_pairwise_distance_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_pixel_shuffle_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_poisson_nll_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_prelu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_rms_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_rrelu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_smooth_l1_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_softmin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nn_functional_softshrink_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_nonzero_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_normal_in_place_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_ravel_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_remainder_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_reshape_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_round_decimals_3_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_round_decimals_neg_3_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_rsqrt_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_scatter_reduce_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_select_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_signbit_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_sinc_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_sinh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_bessel_y1_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_chebyshev_polynomial_t_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_laguerre_polynomial_l_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_log_ndtr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_modified_bessel_k0_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_modified_bessel_k1_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_shifted_chebyshev_polynomial_v_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_special_xlog1py_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_sub_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_tile_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_torch_ops_aten__efficient_attention_forward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_triu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_true_divide_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_var_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_vdot_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_operator_view_as_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay___rsub___cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay__segment_reduce_lengths_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_amin_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_as_strided_copy_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_asinh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_atan_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_atleast_3d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_bernoulli_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_broadcast_shapes_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_bucketize_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_byte_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_cauchy_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_char_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_clamp_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_diagonal_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_div_floor_rounding_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_div_trunc_rounding_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_erfc_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_erfinv_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_fft_ifft_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_fft_rfftn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_fill_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_flatten_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_float_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_floor_divide_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_fmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_fmod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_gradient_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_hsplit_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_hstack_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_index_reduce_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_isinf_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_isnan_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_le_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_lerp_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_cholesky_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_det_singular_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_eigh_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_householder_product_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_matrix_rank_hermitian_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_multi_dot_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_qr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linalg_vander_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_linspace_tensor_overload_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_log10_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_log_softmax_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_logical_not_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_lu_solve_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_masked_fill_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_masked_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_masked_median_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_masked_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_masked_var_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_min_reduction_with_dim_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_mul_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_multinomial_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nansum_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_native_dropout_backward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_new_empty_strided_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_adaptive_max_pool1d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_batch_norm_without_cudnn_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_celu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_dropout2d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_dropout_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_embedding_bag_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_fractional_max_pool3d_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_group_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_interpolate_area_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_interpolate_bilinear_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_interpolate_linear_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_interpolate_trilinear_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_l1_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_layer_norm_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_leaky_relu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_linear_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_mse_loss_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_pixel_shuffle_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_pixel_unshuffle_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_selu_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_nn_functional_softplus_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_pinverse_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_polygamma_polygamma_n_1_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_prod_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_put_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_randn_like_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_resize_as__cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_rot90_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_round_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_scatter_reduce_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_signal_windows_bartlett_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_signal_windows_general_hamming_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_signal_windows_hann_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_signbit_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_slice_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_sparse_mm_reduce_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_special_legendre_polynomial_p_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_special_log_ndtr_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_special_ndtri_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_special_scaled_modified_bessel_k0_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_special_shifted_chebyshev_polynomial_u_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_std_mean_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_sub_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_tensor_split_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_torch_ops_aten__efficient_attention_forward_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_transpose_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_uniform_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_unique_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_view_as_complex_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_view_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_vsplit_cuda_float32, test/test_ops.py::TestCompositeComplianceCUDA::test_view_replay_vstack_cuda_float32, test/test_ops.py::TestMathBitsCUDA::test_conj_view___rsub___cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs__conversions_cdouble_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_addcmul_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_allclose_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_as_strided_partial_views_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_conj_physical_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_empty_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_expm1_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_float_power_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_hsplit_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_index_select_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_linspace_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_log2_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_log_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_logspace_tensor_overload_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_mul_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_narrow_copy_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_nn_functional_log_softmax_with_dtype_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_nn_functional_triplet_margin_loss_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_reciprocal_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_reshape_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_rsqrt_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_rsub_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_sigmoid_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_special_softmax_with_dtype_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_sqrt_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_stack_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_to_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_triu_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_unbind_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_unflatten_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view__refs_where_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_addmm_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_bfloat16_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_block_diag_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_bool_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_char_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_combinations_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_corrcoef_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_cosh_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_cov_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_diag_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_empty_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_expm1_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_fft_irfft_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_fft_irfftn_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_gather_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_index_put_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_linalg_cholesky_ex_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_linalg_lstsq_grad_oriented_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_linalg_lu_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_linalg_lu_solve_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_linalg_matrix_power_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_linalg_vander_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_mH_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_masked_scatter_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_meshgrid_list_of_tensors_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_mm_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_movedim_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_mv_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_nn_functional_rms_norm_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_nn_functional_triplet_margin_loss_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_norm_fro_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_pinverse_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_repeat_interleave_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_reshape_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_resize__cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_scatter_add_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_sigmoid_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_split_with_sizes_copy_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_squeeze_multiple_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_std_mean_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_trace_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_unfold_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_conj_view_vsplit_cuda_complex64, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view___rmatmul___cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view___rmul___cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view___rsub___cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs__conversions_bfloat16_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs__conversions_float_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs__conversions_half_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_contiguous_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_count_nonzero_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_diag_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_dsplit_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_dstack_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_empty_strided_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_fft_irfft2_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_fill_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_isinf_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_istft_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_log1p_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_log_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_logical_or_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_meshgrid_variadic_tensors_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_new_zeros_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_nn_functional_log_softmax_with_dtype_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_nn_functional_pairwise_distance_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_nn_functional_softmax_with_dtype_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_normal__in_place_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_reshape_as_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_sigmoid_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_sqrt_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_sum_to_size_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_take_along_dim_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_tan_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view__refs_triu_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_add_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_addcdiv_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_all_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_atleast_2d_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_bmm_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_bool_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_byte_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_cos_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_count_nonzero_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_diag_embed_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_div_no_rounding_mode_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_empty_strided_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_expand_as_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_fft_ifft_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_fill_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_fliplr_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_isreal_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_jiterator_4inputs_with_extra_args_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_jiterator_binary_return_by_ref_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_ldexp_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_diagonal_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_ldl_factor_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_ldl_factor_ex_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_lu_factor_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_lu_factor_ex_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_matrix_rank_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_matrix_rank_hermitian_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_norm_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_slogdet_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_tensorsolve_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_linalg_vander_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_mH_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_masked_sum_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_mean_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_nn_functional_conv1d_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_nn_functional_conv2d_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_nn_functional_linear_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_nn_functional_pairwise_distance_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_nn_functional_rms_norm_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_nn_functional_unfold_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_permute_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_ravel_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_reshape_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_rsub_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_sgn_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_sinh_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_softmax_with_dtype_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_square_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_stft_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_sum_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_svd_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_take_along_dim_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_tan_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_trapezoid_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_tril_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_unflatten_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_unfold_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_conj_view_view_as_real_cuda_complex128, test/test_ops.py::TestMathBitsCUDA::test_neg_view_T_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__native_batch_norm_legit_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs__conversions_byte_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs__conversions_long_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_acos_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_add_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_alias_copy_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_arange_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_atleast_3d_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_broadcast_to_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_column_stack_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_contiguous_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_cos_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_diagonal_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_div_floor_rounding_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_empty_strided_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_erf_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_erfinv_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_fft_ihfftn_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_fliplr_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_flipud_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_ge_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_hypot_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_isreal_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_lgamma_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_linalg_diagonal_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_linalg_vector_norm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_linspace_tensor_overload_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_log10_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_log1p_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_log_normal_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_nan_to_num_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_narrow_copy_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_native_layer_norm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_ne_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_new_full_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_new_ones_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_nn_functional_l1_loss_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_nn_functional_prelu_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_nn_functional_relu_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_norm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_normal_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_ones_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_prod_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_rad2deg_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_real_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_reshape_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_rot90_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_sinc_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_special_i1e_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_special_log_ndtr_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_special_log_softmax_with_dtype_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_special_multigammaln_mvlgamma_p_1_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_special_ndtri_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_tril_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_unflatten_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_view_as_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__refs_xlogy_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view__unsafe_masked_index_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_abs_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_acos_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_addcdiv_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_addmm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_as_strided_scatter_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_atan_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_baddbmm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_contiguous_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_cos_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_cross_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_cumulative_trapezoid_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_deg2rad_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_dist_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_div_trunc_rounding_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_erfc_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_erfinv_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_exponential_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_flatten_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_frac_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_grid_sampler_2d_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_hstack_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_index_reduce_mean_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_jiterator_4inputs_with_extra_args_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_jiterator_unary_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_linalg_inv_ex_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_linalg_lu_solve_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_linalg_matrix_rank_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_linalg_solve_triangular_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_linalg_vector_norm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_linspace_tensor_overload_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_log10_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_log_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_logdet_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_long_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_mH_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_masked_argmax_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_masked_fill_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_masked_logsumexp_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_masked_norm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_masked_prod_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_matmul_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_max_reduction_no_dim_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_min_reduction_no_dim_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_mm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_mode_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_native_batch_norm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_ne_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_new_zeros_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_alpha_dropout_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_avg_pool1d_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_avg_pool2d_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_conv3d_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_conv_transpose1d_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_dropout_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_feature_alpha_dropout_with_train_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_hardswish_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_interpolate_linear_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_local_response_norm_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_max_unpool1d_grad_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_mish_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_pad_reflect_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_pad_replicate_negative_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_pdist_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_silu_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_soft_margin_loss_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_softshrink_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_nn_functional_threshold_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_norm_fro_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_normal_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_ormqr_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_pca_lowrank_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_pinverse_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_polygamma_polygamma_n_0_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_polygamma_polygamma_n_2_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_polygamma_polygamma_n_3_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_resize_as__cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_short_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_signal_windows_hann_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_softmax_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_special_chebyshev_polynomial_t_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_special_chebyshev_polynomial_u_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_special_chebyshev_polynomial_v_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_special_i0e_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_special_i1e_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_sqrt_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_squeeze_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_stack_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_svd_lowrank_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_triu_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_unflatten_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_unfold_copy_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_unfold_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_unsafe_split_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_var_cuda_float64, test/test_ops.py::TestMathBitsCUDA::test_neg_view_zeros_cuda_float64, test/test_ops.py::TestFakeTensorCUDA::test_fake___rmod___cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake__chunk_cat_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_add_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_amax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_amin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_argmin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_asin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_atanh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_atleast_2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast___rand___cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast__segment_reduce_offsets_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_addmm_decomposed_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_addmv_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_argmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_argsort_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_as_strided_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_asinh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_baddbmm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_bfloat16_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_bitwise_and_cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_bitwise_left_shift_cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_cholesky_inverse_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_conj_physical_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_cov_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_cumsum_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_cumulative_trapezoid_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_erf_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_fft_fft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_fft_hfftn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_fft_irfft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_fft_rfft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_fft_rfftn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_floor_divide_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_frexp_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_isclose_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_jiterator_binary_return_by_ref_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_jiterator_unary_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_lcm_cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_lgamma_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_linalg_det_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_linalg_inv_ex_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_linalg_ldl_factor_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_linalg_ldl_solve_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_linalg_tensorinv_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_log1p_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_log_normal_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_log_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_masked_cumsum_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_masked_logsumexp_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_masked_var_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_max_pool2d_with_indices_backward_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_mvlgamma_mvlgamma_p_1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_mvlgamma_mvlgamma_p_3_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nanquantile_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_new_full_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_adaptive_avg_pool1d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_adaptive_avg_pool3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_alpha_dropout_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_avg_pool1d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_batch_norm_without_cudnn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_conv_transpose2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_cosine_similarity_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_embedding_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_fractional_max_pool2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_grid_sample_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_interpolate_linear_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_pixel_unshuffle_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_poisson_nll_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_softplus_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_softsign_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_nn_functional_threshold_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_norm_nuc_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_pow_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_ravel_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_renorm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_round_decimals_0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_rsqrt_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_scatter_reduce_amin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_scatter_reduce_sum_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_sign_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_signal_windows_exponential_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_signal_windows_general_hamming_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_special_bessel_j1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_special_modified_bessel_k1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_special_ndtr_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_special_polygamma_special_polygamma_n_0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_special_shifted_chebyshev_polynomial_u_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_special_spherical_bessel_j0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_std_mean_unbiased_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_sum_to_size_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_tile_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_topk_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_unique_consecutive_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_unsafe_split_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_autocast_unsqueeze_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_bitwise_not_cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_fake_bitwise_right_shift_cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_fake_block_diag_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_bmm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_bool_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_broadcast_shapes_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_broadcast_tensors_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_cdouble_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_char_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_clamp_max_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_cov_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp___rsub___cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp__batch_norm_with_update_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp__softmax_backward_data_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_add_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_addmm_decomposed_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_addmv_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_as_strided_scatter_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_atleast_3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_cat_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_clamp_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_clone_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_cosh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_diff_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_exp2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_exp_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_expand_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_expm1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fft_fftshift_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fft_hfft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fft_ifft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fft_ifftshift_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fft_irfft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fft_rfft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fill_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_flip_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_fliplr_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_hsplit_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_index_reduce_amax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_index_reduce_amin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_inv_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_pinv_hermitian_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_solve_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_vector_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_log10_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_log_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_masked_log_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_masked_select_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_max_pool2d_with_indices_backward_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_max_reduction_no_dim_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_meshgrid_variadic_tensors_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_mvlgamma_mvlgamma_p_5_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_avg_pool2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_avg_pool3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_bilinear_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_elu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_instance_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_interpolate_area_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_interpolate_nearest_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_leaky_relu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_linear_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_max_pool3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_max_unpool2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_max_unpool3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_pad_circular_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_pad_constant_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_selu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_soft_margin_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_softplus_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_nn_functional_triplet_margin_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_polygamma_polygamma_n_4_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_scatter_reduce_prod_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_scatter_reduce_sum_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_sgn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_sinc_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_sinh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_sort_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_special_ndtri_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_special_polygamma_special_polygamma_n_0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_split_with_sizes_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_std_mean_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_stft_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_topk_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_tril_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_unflatten_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_amp_view_copy_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp___rdiv___cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp___rpow___cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp__segment_reduce_offsets_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp__softmax_backward_data_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_addr_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_as_strided_copy_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_atan_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_atleast_1d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_atleast_2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_block_diag_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_cdist_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_ceil_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_cholesky_solve_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_contiguous_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_cummax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_diagonal_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_diagonal_scatter_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_dist_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_div_floor_rounding_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_fft_hfft_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_flatten_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_floor_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_fmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_index_add_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_index_put_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_index_reduce_amin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_det_singular_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_inv_ex_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_lstsq_grad_oriented_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_slogdet_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_vector_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_mT_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_masked_mean_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_masked_median_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_masked_prod_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_masked_scatter_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_masked_softmin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_masked_std_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_matmul_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_max_binary_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_min_reduction_no_dim_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_min_reduction_with_dim_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_mul_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_neg_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_alpha_dropout_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_conv2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_conv3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_cross_entropy_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_embedding_bag_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_interpolate_nearest_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_interpolate_trilinear_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_margin_ranking_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_max_unpool1d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_max_unpool1d_grad_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_max_unpool3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_mish_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_multi_head_attention_forward_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_pad_constant_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_rms_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_selu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_triplet_margin_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_nn_functional_upsample_bilinear_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_norm_nuc_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_ormqr_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_real_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_repeat_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_rot90_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_sgn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_sign_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_sparse_mm_reduce_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_special_erfcx_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_special_i0e_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_special_ndtr_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_split_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_tan_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_unfold_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_unsafe_split_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_var_mean_unbiased_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_cummin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_expm1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_eye_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_fft_fft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_fft_fftn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_fft_irfft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_fft_rfftn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_float_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_gather_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_ge_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_geqrf_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_heaviside_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_imag_cuda_complex64, test/test_ops.py::TestFakeTensorCUDA::test_fake_index_fill_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_isnan_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_isneginf_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linalg_cholesky_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linalg_det_singular_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linalg_diagonal_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linalg_eigvals_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linalg_lstsq_grad_oriented_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linalg_matrix_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linalg_vecdot_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_linspace_tensor_overload_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_log_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_logical_not_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_logit_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_lu_unpack_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_masked_cumprod_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_masked_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_max_reduction_with_dim_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_mvlgamma_mvlgamma_p_1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nanquantile_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_new_ones_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_adaptive_avg_pool3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_avg_pool2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_cross_entropy_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_dropout3d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_elu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_gaussian_nll_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_hardtanh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_instance_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_interpolate_bicubic_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_margin_ranking_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_max_unpool1d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_multilabel_soft_margin_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_pairwise_distance_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_relu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_silu_complex_cuda_complex64, test/test_ops.py::TestFakeTensorCUDA::test_fake_nn_functional_softplus_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_norm_fro_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_norm_inf_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_normal_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_pinverse_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_resolve_conj_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_resolve_neg_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_round_decimals_0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_scatter_reduce_amin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_scatter_reduce_prod_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_sinh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_special_polygamma_special_polygamma_n_0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_special_xlog1py_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_special_zeta_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_sum_to_size_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_fake_unsqueeze_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops___rmod___cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops___rpow___cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops__upsample_bilinear2d_aa_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_abs_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_acosh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_addmm_decomposed_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_alias_copy_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_atan_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_atleast_2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_bfloat16_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_bincount_cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_bucketize_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_constant_pad_nd_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_diag_embed_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_expm1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_eye_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_fft_fft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_fft_fftn_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_fft_ifft2_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_fft_ifft_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_fft_irfft_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_fft_rfft_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_floor_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_ge_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_geometric_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_geqrf_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_index_add_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_isfinite_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_isnan_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_istft_cuda_complex64, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_item_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_kthvalue_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_le_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_linalg_cond_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_linalg_cross_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_linalg_eig_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_linalg_matrix_rank_hermitian_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_linalg_pinv_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_linalg_svd_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_log_normal_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_log_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_log_softmax_with_dtype_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_logical_not_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_logical_xor_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_logsumexp_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_lu_unpack_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_masked_log_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_masked_median_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_masked_softmax_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_mean_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_meshgrid_variadic_tensors_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_movedim_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nanquantile_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_native_layer_norm_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nextafter_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_adaptive_avg_pool2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_binary_cross_entropy_with_logits_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_conv_transpose2d_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_ctc_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_feature_alpha_dropout_with_train_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_hardtanh_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_interpolate_nearest-exact_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_interpolate_nearest_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_kl_div_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_leaky_relu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_max_unpool1d_grad_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_mse_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_multi_margin_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_pairwise_distance_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_poisson_nll_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_relu6_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_rrelu_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_silu_complex_cuda_complex64, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_smooth_l1_loss_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_softmin_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_softsign_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_tanhshrink_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_nn_functional_threshold_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_norm_nuc_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_ormqr_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_polygamma_polygamma_n_1_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_rad2deg_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_rand_like_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_randn_like_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_ravel_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_reshape_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_resolve_conj_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_rot90_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_round_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_round_decimals_0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_scatter_add_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_signal_windows_hamming_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_signal_windows_kaiser_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_special_bessel_j0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_special_hermite_polynomial_h_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_special_laguerre_polynomial_l_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_special_polygamma_special_polygamma_n_0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_special_scaled_modified_bessel_k0_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_squeeze_multiple_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_std_unbiased_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_sum_to_size_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_t_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_tan_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_trace_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_triu_indices_cuda_int64, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_unfold_copy_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_unsqueeze_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_var_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_pointwise_ops_vsplit_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_linspace_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_logspace_cuda_float16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_logspace_cuda_int8, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_logspace_tensor_overload_cuda_complex128, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_logspace_tensor_overload_cuda_float16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_logspace_tensor_overload_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_ones_cuda_bool, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_ones_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_ones_cuda_int8, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_zeros_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout__refs_zeros_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_full_cuda_bool, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_full_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_full_cuda_uint8, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_linspace_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_logspace_cuda_complex128, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_logspace_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_logspace_cuda_uint8, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_logspace_tensor_overload_cuda_bfloat16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_logspace_tensor_overload_cuda_float16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_logspace_tensor_overload_cuda_int8, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_ones_cuda_bool, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_ones_cuda_float64, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_ones_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_zeros_cuda_bfloat16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_zeros_cuda_complex32, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_zeros_cuda_complex64, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_zeros_cuda_float32, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_zeros_cuda_int16, test/test_ops.py::TestFakeTensorCUDA::test_strided_layout_zeros_cuda_int32, test/test_ops.py::TestTagsCUDA::test_tags___getitem___cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags___rdiv___cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags___ror___cuda_int64, test/test_ops.py::TestTagsCUDA::test_tags__refs_T_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs__conversions_cfloat_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs__conversions_double_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs__conversions_float_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_all_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_amin_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_as_strided_copy_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_bitwise_or_cuda_int64, test/test_ops.py::TestTagsCUDA::test_tags__refs_block_diag_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_broadcast_shapes_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_diag_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_diagonal_copy_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_dstack_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_empty_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_fft_hfftn_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_fft_ifft2_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_fmax_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_geometric_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_hstack_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_hypot_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_index_select_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_isneginf_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_lcm_cuda_int64, test/test_ops.py::TestTagsCUDA::test_tags__refs_linalg_norm_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_linspace_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_log10_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_log_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_logaddexp2_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_logspace_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_new_ones_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_nn_functional_leaky_relu_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_nn_functional_poisson_nll_loss_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_nn_functional_softmin_with_dtype_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_pow_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_remainder_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_special_log_softmax_with_dtype_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_special_multigammaln_mvlgamma_p_1_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_special_multigammaln_mvlgamma_p_3_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_special_multigammaln_mvlgamma_p_5_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_special_ndtri_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_stack_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_take_along_dim_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_tanh_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_tril_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_unfold_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_unsqueeze_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__refs_where_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags__upsample_bilinear2d_aa_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_acos_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_all_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_argsort_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_argwhere_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_as_strided_copy_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_atleast_1d_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_bitwise_xor_cuda_int64, test/test_ops.py::TestTagsCUDA::test_tags_clamp_max_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_column_stack_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_contiguous_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_diagflat_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_dot_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_dsplit_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_empty_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_erfinv_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_expand_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_exponential_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_fill_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_flip_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_frexp_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_index_reduce_amin_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_item_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_lgamma_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_cond_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_det_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_diagonal_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_eigh_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_lstsq_grad_oriented_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_qr_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_solve_triangular_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_tensorsolve_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linalg_vecdot_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_linspace_tensor_overload_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_masked_cumsum_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_masked_std_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_masked_var_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_max_pool2d_with_indices_backward_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_meshgrid_variadic_tensors_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_movedim_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_multinomial_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_mvlgamma_mvlgamma_p_1_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nan_to_num_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_new_ones_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_avg_pool1d_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_avg_pool3d_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_batch_norm_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_binary_cross_entropy_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_conv1d_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_conv3d_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_elu_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_feature_alpha_dropout_without_train_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_fractional_max_pool2d_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_grid_sample_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_group_norm_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_hardshrink_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_hardtanh_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_interpolate_bicubic_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_interpolate_bilinear_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_interpolate_linear_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_max_pool3d_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_pairwise_distance_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_pdist_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_rms_norm_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_soft_margin_loss_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_softsign_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_nn_functional_upsample_bilinear_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_normal_in_place_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_normal_number_mean_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_pinverse_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_randn_like_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_resize_as__cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_rsub_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_signal_windows_cosine_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_signal_windows_gaussian_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_slice_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_sparse_mm_reduce_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_sparse_sampled_addmm_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_special_airy_ai_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_special_chebyshev_polynomial_w_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_special_laguerre_polynomial_l_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_special_legendre_polynomial_p_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_special_modified_bessel_i1_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_squeeze_multiple_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_std_mean_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_tan_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_to_sparse_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_torch_ops_aten__flash_attention_forward_cuda_float16, test/test_ops.py::TestTagsCUDA::test_tags_true_divide_cuda_float32, test/test_ops.py::TestTagsCUDA::test_tags_view_copy_cuda_float32 2024-06-26T07:58:31.8484335Z 2024-06-26T07:59:29.7896309Z 2024-06-26T07:59:29.7897895Z test_sparse_csr 2/2 was successful, full logs can be found in artifacts with path test/test-reports/test_sparse_csr_2.2_71d3a4f437f0d4db_.log 2024-06-26T07:59:29.9048989Z Running 2414 items in this shard: test/test_sparse_csr.py::TestSparseCSRCUDA::test_add_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_add_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_dense_result_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_errors_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_0_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_0_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_0_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_25_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_0_m_25_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_0_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_0_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_1_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_10_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_0_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_1_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_0_n_1_m_25_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_0_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_0_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_1_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_25_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_0_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_0_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_1_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_10_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_0_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_0_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_1_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_1_n_1_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_0_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_1_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_0_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_0_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_25_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_10_m_25_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_0_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_0_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_k_8_n_1_m_25_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_11x9_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_11x9_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_11x9_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_3x3_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_5x7_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_addmv_shape_5x7_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_abs_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_angle_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_asinh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_atan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_atanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_erf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_erfinv_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_expm1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_isnan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_isneginf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_positive_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sgn_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sgn_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_sinh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_autograd_sparse_csr_unary_trunc_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_baddbmm_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_baddbmm_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_baddbmm_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_True_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int32_noncontiguous_True_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_False_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_True_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_True_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_2_int64_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int32_noncontiguous_False_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int32_noncontiguous_False_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int32_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int32_noncontiguous_True_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_False_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmm_block_size_3_int64_noncontiguous_True_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int32_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int64_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int64_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_2_int64_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int32_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int32_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_addmv_block_size_3_int64_noncontiguous_True_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int32_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int32_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int32_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_False_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_2_int64_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_False_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_False_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int32_noncontiguous_True_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int64_noncontiguous_False_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int64_noncontiguous_True_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_block_triangular_solve_block_size_3_int64_noncontiguous_True_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSC_SparseBSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSC_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSR_SparseBSR_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseBSR_SparseCSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseCSC_SparseCSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseCSR_SparseBSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseCSR_SparseCSC_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_compressed_layout_conversions_coverage_SparseCSR_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_csr_conversion_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_coo_to_csr_convert_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_coo_conversion_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_coo_conversion_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_coo_conversion_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_coo_conversion_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_coo_conversion_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_matvec_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_matvec_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_matvec_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_matvec_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_storage_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_stride_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_to_block_csr_blocksize_2_cuda_float64_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_csr_to_block_csr_blocksize_4_cuda_float64_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseBSC_NonBatched_Hybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseBSC_NonBatched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseBSR_Batched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseBSR_NonBatched_Hybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseBSR_NonBatched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_dense_to_from_sparse_compressed_SparseCSC_NonBatched_NonHybrid_cuda, test/test_sparse_csr.py::TestSparseCSRCUDA::test_direct_coo_csr_conversion_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_direct_coo_csr_conversion_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_direct_coo_csr_conversion_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_exercise_detach_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_matmul_device_mismatch_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mm_errors_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_mul_scalar_enable_hybrid_False_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_as_sparse_compressed_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_as_sparse_compressed_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_as_sparse_compressed_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_as_sparse_compressed_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_as_sparse_compressed_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_as_sparse_compressed_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_resize_errors_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_autograd_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_autograd_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_errors_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_errors_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int64_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSR_int64_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSC_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseCSR_int64_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_add_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csc_to_dense_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_from_dense_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_from_dense_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_from_dense_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_from_dense_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_from_dense_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_to_dense_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_abs_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_angle_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_angle_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_angle_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_angle_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asin_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_asinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_ceil_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_ceil_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_ceil_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_ceil_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_deg2rad_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_deg2rad_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_deg2rad_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_erfinv_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_expm1_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_floor_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_floor_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_floor_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_floor_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_frac_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isinf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isnan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isneginf_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isneginf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isneginf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isneginf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_isposinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_neg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_neg_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_neg_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_nn_functional_relu_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_positive_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_rad2deg_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_round_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_round_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_round_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_round_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_round_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sgn_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sign_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sign_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sin_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sinh_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_sqrt_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_tanh_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_trunc_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_trunc_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_trunc_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_inplace_trunc_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_abs_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_angle_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_angle_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_angle_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_asinh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_ceil_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_ceil_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_ceil_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_deg2rad_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_erfinv_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_expm1_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_expm1_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_floor_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_frac_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_frac_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isinf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isnan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isneginf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isneginf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_isposinf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_log1p_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_neg_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_nn_functional_relu_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_nn_functional_relu_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_nn_functional_relu_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_positive_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_rad2deg_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_round_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sgn_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sign_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sign_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sign_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sign_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_signbit_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_signbit_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_signbit_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sin_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sinh_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_sqrt_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_tanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_csr_unary_out_trunc_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_mm_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_mm_reduce_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_mm_reduce_sum_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_to_sparse_compressed_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_to_sparse_compressed_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_triangular_solve_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_transpose_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_abs_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_angle_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_asinh_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atan_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_deg2rad_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_deg2rad_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_deg2rad_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erf_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erf_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erf_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_erfinv_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_expm1_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_floor_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_frac_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_frac_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isinf_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isinf_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isinf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isnan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isneginf_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isposinf_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_isposinf_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_log1p_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_neg_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_nn_functional_relu_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_positive_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_rad2deg_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_round_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_round_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_round_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sgn_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sign_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sign_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sign_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_signbit_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_int64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sin_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_int16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_sqrt_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_complex128, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_int8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tan_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_bool, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_float32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_int32, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_tanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_float64, test/test_sparse_csr.py::TestSparseCSRCUDA::test_zero_to_zero_correspondence_unary_trunc_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_clone_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_abs_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_angle_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_asinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_conj_physical_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_deg2rad_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_erfinv_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_expm1_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_floor_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_floor_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_floor_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_floor_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isnan_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isnan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isnan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isnan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isneginf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isneginf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isneginf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isneginf_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_isposinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_log1p_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amax_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_amin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_mean_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_prod_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_masked_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_mul_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_neg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_nn_functional_relu_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_nn_functional_relu_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_nn_functional_relu_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_positive_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_rad2deg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_randn_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_randn_like_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_randn_like_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_randn_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_round_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sgn_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sign_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sign_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sqrt_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_sum_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tanh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tanh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tanh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_tanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_to_sparse_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_trunc_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_trunc_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_trunc_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_trunc_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_trunc_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSC_zeros_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_abs_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_angle_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_angle_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_angle_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_asinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_atanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_ceil_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_conj_physical_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_deg2rad_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_erfinv_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_expm1_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_floor_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_floor_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_floor_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_floor_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_floor_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_frac_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_frac_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isnan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isnan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isnan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isneginf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isneginf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isneginf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isposinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isposinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_isposinf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_log1p_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amax_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amax_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amax_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amax_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amax_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_amin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_mean_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_mean_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_prod_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_masked_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_mul_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_neg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_nn_functional_relu_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_positive_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_rad2deg_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_randn_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_randn_like_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_round_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_round_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_round_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_round_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_round_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sgn_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sign_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sign_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sign_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sign_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sqrt_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sum_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sum_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_tanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_to_sparse_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_trunc_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_trunc_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseBSR_zeros_like_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_abs_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_angle_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asin_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_asinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atan_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_ceil_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_ceil_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_ceil_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_ceil_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_conj_physical_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_deg2rad_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_deg2rad_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_deg2rad_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_erfinv_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_expm1_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_floor_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_floor_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_floor_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_floor_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_floor_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_floor_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_frac_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isnan_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isnan_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isnan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isneginf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_isposinf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_log1p_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amax_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_amin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_mean_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_prod_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_masked_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_mul_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_neg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_nn_functional_relu_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_positive_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_rad2deg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_randn_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_randn_like_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_randn_like_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_randn_like_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_round_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_round_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_round_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sgn_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sign_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_signbit_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sinh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sqrt_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_sum_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tan_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_tanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_to_sparse_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_trunc_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_zeros_like_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_zeros_like_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_zeros_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_zeros_like_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSC_zeros_like_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_abs_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_angle_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_asinh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_atanh_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_ceil_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_ceil_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_ceil_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_ceil_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_conj_physical_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_deg2rad_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_deg2rad_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_deg2rad_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_erfinv_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_expm1_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_floor_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_floor_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_floor_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_floor_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_floor_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_frac_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_frac_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isnan_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isneginf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isposinf_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isposinf_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isposinf_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_isposinf_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_log1p_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amax_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amax_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amax_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amax_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amax_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amin_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amin_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amin_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_amin_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_mean_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_prod_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_masked_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_mul_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_neg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_nn_functional_relu_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_nn_functional_relu_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_nn_functional_relu_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_nn_functional_relu_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_positive_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_rad2deg_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_rad2deg_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_rad2deg_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_rad2deg_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_rad2deg_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_randn_like_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_randn_like_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_randn_like_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_randn_like_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_round_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_round_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_round_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_round_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_round_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sgn_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sgn_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sgn_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sgn_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sign_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_signbit_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_complex32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sin_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sinh_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sqrt_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_sum_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tan_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_tanh_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_to_sparse_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_trunc_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_trunc_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_trunc_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_trunc_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_consistency_SparseCSR_zeros_like_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_copy_errors_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_dim_SparseBSC_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_dim_SparseBSR_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_dim_SparseCSC_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_dim_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_errors_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSC_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseBSR_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSC_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_empty_like_SparseCSR_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseBSC_target_sparse_compressed_tensor_no_size_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseBSR_target_sparse_compressed_tensor_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_invalid_input_SparseBSR_target_sparse_compressed_tensor_no_size_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_layout_SparseBSR_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_layout_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_pickle_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_pickle_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_pickle_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_print_SparseBSR_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_print_SparseCSR_cuda, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSC_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseBSR_int64_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int32_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int64_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSC_int64_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int32_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_select_copy_SparseCSR_int64_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_list_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_____from_tensor_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_list_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor___factory_from_tensor_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_list_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference___from_tensor_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_list_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_constructor_shape_and_device_inference_factory_from_tensor_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSC_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_sparse_compressed_tensor_with_dims_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_complex128, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_to_dtype_SparseCSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSC_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSR_cuda_bool, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSR_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseBSR_cuda_uint8, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_float64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSC_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_complex64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_int16, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_int32, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_int64, test/test_sparse_csr.py::TestSparseCompressedCUDA::test_validate_SparseCSR_cuda_int8, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_16_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_16_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_16_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_32_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_32_int32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_32_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_32_int64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_32_int64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_64_int32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_64_int32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_dense_bmm_block_size_64_int64_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_16_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_2_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_2_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_2x3_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_scatter_mm_blocksize_32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_softmax_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_softmax_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_bsr_softmax_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_addmm_blocksize_16_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_addmm_blocksize_16_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_addmm_blocksize_32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_linear_blocksize_16_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_linear_blocksize_16x32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_16_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_16_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_16x32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_kernel_op_bsr_dense_mm_blocksize_32_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_sampled_addmm_block_size_64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_sampled_addmm_block_size_64_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_sampled_addmm_block_size_64_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_16_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_32_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_32_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_64_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scaled_dot_product_attention_block_size_64_cuda_float32, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_scatter_mm_cuda_float16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_tune_op_bsr_dense_addmm_cuda_bfloat16, test/test_sparse_csr.py::TestSparseCompressedTritonKernelsCUDA::test_triton_tune_op_bsr_dense_addmm_cuda_float16 2024-06-26T07:59:30.0161814Z 2024-06-26T08:04:29.5241449Z 2024-06-26T08:04:29.5244278Z test_transformers 1/3 was successful, full logs can be found in artifacts with path test/test-reports/test_transformers_1.3_6bed6d32a3c435d1_.log 2024-06-26T08:04:31.7591541Z Running 24962 items in this shard: test/test_transformers.py::TestTransformersCUDA::test_bias_is_none_cuda, test/test_transformers.py::TestTransformersCUDA::test_encoder_is_causal_cuda, test/test_transformers.py::TestTransformersCUDA::test_encoder_padding_and_src_mask_bool_cuda, test/test_transformers.py::TestTransformersCUDA::test_mha_native_args_nb_heads_1_bias_False_cuda, test/test_transformers.py::TestTransformersCUDA::test_mha_native_args_nb_heads_1_bias_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_mha_native_args_nb_heads_8_bias_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_multiheadattention_fastpath_attn_mask_attn_mask_dim2_key_padding_mask_dim1_float32_cuda, test/test_transformers.py::TestTransformersCUDA::test_multiheadattention_fastpath_attn_mask_attn_mask_dim_3_key_padding_mask_dim1_float32_cuda, test/test_transformers.py::TestTransformersCUDA::test_multiheadattention_fastpath_attn_mask_attn_mask_dim_3_key_padding_mask_dim_2_bool_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_3D_input_dim_3D_attn_mask_dropout_p_0_0_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_3D_input_dim_3D_attn_mask_dropout_p_0_2_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_2D_attn_mask_dropout_p_0_0_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_2D_attn_mask_dropout_p_0_5_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_4D_attn_mask_dropout_p_0_0_cuda, test/test_transformers.py::TestTransformersCUDA::test_scaled_dot_product_attention_4D_input_dim_4D_causal_attn_mask_dropout_p_0_2_cuda, test/test_transformers.py::TestTransformersCUDA::test_self_attn_TxT_attn_mask_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_batch_first_False_training_False_enable_nested_tensor_False_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_batch_first_False_training_True_enable_nested_tensor_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_batch_first_True_training_False_enable_nested_tensor_True_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_fastpath_use_torchscript_False_enable_nested_tensor_True_use_autocast_False_d_model_256_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoder_square_input_with_no_grad_True_training_False_enable_nested_tensor_False_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoderlayer_no_fastpath_with_hooks_nhead_4_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoderlayer_src_mask_nhead_1_cuda, test/test_transformers.py::TestTransformersCUDA::test_transformerencoderlayer_src_mask_nhead_4_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_flash_atteention_large_bf16_nan_values_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_flash_autocast_fp32_float16_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_flash_backward_failure_sm86plus_head_dim_204_dropout_p_0_0_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_head_dim_kernel0_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_head_dim_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_fused_inputs_invalid_dtype_kernel0_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_1_dimensional_inputs_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_datatypes_kernel0_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_datatypes_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_datatypes_kernel2_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_inputs_different_devices_kernel1_cuda, test/test_transformers.py::TestSDPAFailureModesCUDA::test_invalid_sequence_lengths_kernel1_cuda, test/test_transformers.py::TestSDPACUDA::test_fused_sdp_choice_cpu_type_dense_dropout_0_7_bfloat16_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_fused_sdp_choice_cpu_type_nested_dropout_0_0_bfloat16_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_fused_sdp_choice_cpu_type_nested_dropout_0_7_float32_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float16_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_1_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float32_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_0_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_2_bool_mask_1_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_mask_vs_math_cpu_fused_kernel0_float64_batch_size_2_q_seq_len_267_kv_seq_len_514_n_head_3_head_dim_8_mask_dim_4_bool_mask_0_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_1030_n_head_1_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_True_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_1_head_dim_8_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_True_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_True_train_False_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_bfloat16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_True_train_True_cuda_bfloat16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_1030_n_head_1_head_dim_16_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_1030_n_head_1_head_dim_16_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_1030_n_head_1_head_dim_8_causal_True_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_1030_n_head_3_head_dim_8_causal_True_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_True_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_1030_n_head_1_head_dim_8_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_True_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_1030_n_head_3_head_dim_8_causal_True_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_True_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_True_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float16_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_True_train_False_cuda_float16, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_1030_n_head_1_head_dim_16_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_1030_n_head_3_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_False_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_1030_n_head_1_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_1030_n_head_3_head_dim_8_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_False_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_16_causal_True_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_False_train_False_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float32_batch_size_2_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_float32, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_1030_n_head_1_head_dim_8_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_1030_n_head_3_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_1030_n_head_3_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_1_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_False_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_False_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_12_seq_len_267_n_head_3_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_1_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_16_causal_True_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_8_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_1030_n_head_3_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_False_train_False_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_267_n_head_1_head_dim_16_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_scaled_dot_product_fused_attention_vs_math_cpu_fused_kernel0_float64_batch_size_2_seq_len_267_n_head_1_head_dim_8_causal_True_train_True_cuda_float64, test/test_transformers.py::TestSDPACUDA::test_sdp_math_gradcheck_contiguous_inputs_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_21_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_16_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_203_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_587_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_160_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_21_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_160_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_192_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_203_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_21_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_256_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_48_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_48_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_256_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_32_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_256_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_64_dropout_p_0_1_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_256_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_0_float16_scale_l1_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_64_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_0_float16_scale_l1_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_False_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_flash_attention_vs_math_ref_grads_nestedtensor_batch_size_8_max_seq_len_q_32_max_seq_len_kv_32_head_dim_8_dropout_p_0_1_float16_scale0_is_causal_True_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_fused_kernel0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_attention_vs_math_ref_grads_cudagraph_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_fused_kernel1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel0_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_False_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_False_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_False_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_False_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_False_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_False_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_False_expand_v_num_heads_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_kernel1_expand_q_batch_True_expand_k_batch_True_expand_v_batch_True_expand_q_num_heads_True_expand_k_num_heads_True_expand_v_num_heads_False_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_nested_broadcasting_query_dense_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_seq_len_1_inputs_fused_kernel0_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_fused_kernels_seq_len_1_inputs_fused_kernel1_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_eff_attention_pad_mask_float16_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_312_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_408_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_65_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_attn_mask_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_1_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_1024_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_128_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_2048_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_256_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_4_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_512_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_64_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_1024_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_128_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_2048_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_256_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_4_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_512_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_128_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_16_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_32_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_72_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_64_head_dim_96_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_128_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_False_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_16_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_32_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_False_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_64_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_0_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_False_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_72_is_causal_True_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_False_dropout_p_0_22_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float16_scale0_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_8_is_causal_True_dropout_p_0_22_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_False_dropout_p_0_0_float32_scale_l1_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_bfloat16_scale_l1_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_0_float32_scale0_cuda_float32, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_bfloat16_scale0_cuda_bfloat16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attention_vs_math_ref_grads_batch_size_8_seq_len_q_8_seq_len_k_8_head_dim_96_is_causal_True_dropout_p_0_22_float16_scale_l1_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attetntion_mask_variants_mask_dim_3_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_mem_efficient_attetntion_mask_variants_mask_dim_4_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_scaled_dot_product_attention_fused_kernels_packed_accuracy_type_nested_fused_kernel1_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_scaled_dot_product_attention_fused_kernels_packed_type_nested_is_contiguous_True_cuda, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_sdp_flash_attention_grad_against_math_contiguous_inputs_True_is_causal_True_float16_cuda_float16, test/test_transformers.py::TestSDPACudaOnlyCUDA::test_sdp_mem_efficient_grad_against_math_contiguous_inputs_True_is_causal_False_cuda, test/test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda, test/test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape1_cuda, test/test_transformers.py::TestAttnBiasCUDA::test_causal_variants_causal_variant_CausalVariant_UPPER_LEFT_shape0_cuda 2024-06-26T08:04:33.9946261Z 2024-06-26T08:04:33.9946269Z 2024-06-26T08:04:33.9946406Z real 56m34.225s 2024-06-26T08:04:33.9946728Z user 118m34.658s 2024-06-26T08:04:33.9947026Z sys 10m20.525s 2024-06-26T08:04:33.9947320Z + assert_git_not_dirty 2024-06-26T08:04:33.9947935Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 != *rocm* ]] 2024-06-26T08:04:33.9948596Z + [[ linux-focal-cuda12.1-py3.10-gcc9-sm86 != *xla* ]] 2024-06-26T08:04:33.9949115Z ++ git status --porcelain 2024-06-26T08:04:33.9949496Z ++ grep -v '?? third_party' 2024-06-26T08:04:33.9949838Z ++ true 2024-06-26T08:04:33.9950087Z + git_status= 2024-06-26T08:04:33.9950385Z + [[ -n '' ]] 2024-06-26T08:04:33.9950661Z + cleanup_workspace 2024-06-26T08:04:33.9951429Z + echo 'sudo may print the following warning message that can be ignored. The chown command will still run.' 2024-06-26T08:04:33.9952545Z sudo may print the following warning message that can be ignored. The chown command will still run. 2024-06-26T08:04:33.9953504Z + echo ' sudo: setrlimit(RLIMIT_STACK): Operation not permitted' 2024-06-26T08:04:33.9954161Z sudo: setrlimit(RLIMIT_STACK): Operation not permitted 2024-06-26T08:04:33.9954957Z + echo 'For more details refer to https://github.com/sudo-project/sudo/issues/42' 2024-06-26T08:04:33.9955838Z For more details refer to https://github.com/sudo-project/sudo/issues/42 2024-06-26T08:04:33.9956523Z + sudo chown -R 1000 /var/lib/jenkins/workspace 2024-06-26T08:04:33.9995379Z ##[group]Run cat test/**/*_toprint.log || true 2024-06-26T08:04:33.9995909Z cat test/**/*_toprint.log || true 2024-06-26T08:04:34.0003721Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:34.0004231Z env: 2024-06-26T08:04:34.0004512Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.0005130Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.0006063Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.0006729Z ##[endgroup] 2024-06-26T08:04:34.0073688Z cat: test/**/*_toprint.log: No such file or directory 2024-06-26T08:04:34.0106375Z ##[group]Run kill "$MONITOR_SCRIPT_PID" 2024-06-26T08:04:34.0106853Z kill "$MONITOR_SCRIPT_PID" 2024-06-26T08:04:34.0114359Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:34.0114857Z env: 2024-06-26T08:04:34.0115134Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.0115576Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.0116310Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.0116967Z MONITOR_SCRIPT_PID: 17931 2024-06-26T08:04:34.0117320Z ##[endgroup] 2024-06-26T08:04:34.0323074Z Prepare all required actions 2024-06-26T08:04:34.0323645Z Getting action download info 2024-06-26T08:04:34.1842743Z Download action repository 'actions/upload-artifact@v3' (SHA:a8a3f3ad30e3422c9c7b888a15615d19a852ae32) 2024-06-26T08:04:34.3450995Z ##[group]Run ./.github/actions/upload-test-artifacts 2024-06-26T08:04:34.3451522Z with: 2024-06-26T08:04:34.3452035Z file-suffix: test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777 2024-06-26T08:04:34.3452714Z s3-bucket: gha-artifacts 2024-06-26T08:04:34.3453104Z env: 2024-06-26T08:04:34.3453444Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.3453958Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.3454754Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.3455443Z ##[endgroup] 2024-06-26T08:04:34.3487824Z ##[group]Run # Remove any previous test jsons if they exist 2024-06-26T08:04:34.3488565Z # Remove any previous test jsons if they exist 2024-06-26T08:04:34.3489135Z rm -f test-jsons-*.zip 2024-06-26T08:04:34.3489697Z zip -r "test-jsons-${FILE_SUFFIX}.zip" test -i '*.json' 2024-06-26T08:04:34.3497804Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:34.3498323Z env: 2024-06-26T08:04:34.3498644Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.3499160Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.3500113Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.3501080Z FILE_SUFFIX: test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777 2024-06-26T08:04:34.3501765Z ##[endgroup] 2024-06-26T08:04:34.3678905Z adding: test/allowlist_for_publicAPI.json (deflated 79%) 2024-06-26T08:04:34.3708788Z adding: test/benchmark_utils/callgrind_artifacts.json (deflated 92%) 2024-06-26T08:04:34.3709524Z adding: test/minioptest_failures_dict.json (deflated 70%) 2024-06-26T08:04:34.3715314Z adding: test/profiler/profiler_utils_mock_events.json (deflated 87%) 2024-06-26T08:04:34.3718550Z adding: test/test-reports/td_exclusions-c3ba702451057833c245.json (deflated 82%) 2024-06-26T08:04:34.3719417Z adding: test/.pytorch-slow-tests.json (deflated 77%) 2024-06-26T08:04:34.3727489Z adding: test/.pytorch-disabled-tests.json (deflated 84%) 2024-06-26T08:04:34.3757893Z ##[group]Run # Remove any previous test reports if they exist 2024-06-26T08:04:34.3758576Z # Remove any previous test reports if they exist 2024-06-26T08:04:34.3759153Z rm -f test-reports-*.zip 2024-06-26T08:04:34.3759800Z zip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml' -i '*.csv' 2024-06-26T08:04:34.3767789Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:34.3768357Z env: 2024-06-26T08:04:34.3768675Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.3769144Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.3770018Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.3770914Z FILE_SUFFIX: test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777 2024-06-26T08:04:34.3771536Z ##[endgroup] 2024-06-26T08:04:34.3934517Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-d97d703909737afe.xml (deflated 28%) 2024-06-26T08:04:34.3943642Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_dynamic_shapes/inductor.test_torchinductor_dynamic_shapes-68c244cdf5fc79dd.xml (deflated 92%) 2024-06-26T08:04:34.3945757Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_dynamic_shapes/inductor.test_torchinductor_codegen_dynamic_shapes-012d2c3c11bed5fe.xml (deflated 42%) 2024-06-26T08:04:34.3959247Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_dynamic_shapes/inductor.test_torchinductor_codegen_dynamic_shapes-72b66e11479e414b.xml (deflated 92%) 2024-06-26T08:04:34.3961246Z adding: test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-6e3f5e1fff6a9d30.xml (deflated 58%) 2024-06-26T08:04:34.3962852Z adding: test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-67823ccfb538bdd4.xml (deflated 28%) 2024-06-26T08:04:34.3964407Z adding: test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-31b878f5516904a8.xml (deflated 42%) 2024-06-26T08:04:34.3971805Z adding: test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-e97f3b99ad6f181a.xml (deflated 92%) 2024-06-26T08:04:34.3982350Z adding: test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-10da6d58650dd63e.xml (deflated 91%) 2024-06-26T08:04:34.3995806Z adding: test/test-reports/python-pytest/inductor.test_torchinductor/inductor.test_torchinductor-204551e9606561a1.xml (deflated 91%) 2024-06-26T08:04:34.3997444Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor/inductor.test_aot_inductor-c57cce81aafcabcb.xml (deflated 28%) 2024-06-26T08:04:34.3998967Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor/inductor.test_aot_inductor-ad87cfaf3afd02c5.xml (deflated 28%) 2024-06-26T08:04:34.4000509Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor/inductor.test_aot_inductor-ac79dba438bd27df.xml (deflated 91%) 2024-06-26T08:04:34.4002150Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor/inductor.test_aot_inductor-1c9ce7545319b0e0.xml (deflated 91%) 2024-06-26T08:04:34.4003729Z adding: test/test-reports/python-pytest/inductor.test_cuda_cpp_wrapper/inductor.test_cuda_cpp_wrapper-d24446aeb47f5901.xml (deflated 28%) 2024-06-26T08:04:34.4005602Z adding: test/test-reports/python-pytest/inductor.test_cuda_cpp_wrapper/inductor.test_cuda_cpp_wrapper-460120d9ea2352bf.xml (deflated 28%) 2024-06-26T08:04:34.4007229Z adding: test/test-reports/python-pytest/inductor.test_cuda_cpp_wrapper/inductor.test_cuda_cpp_wrapper-77389a44f76275d8.xml (deflated 86%) 2024-06-26T08:04:34.4008802Z adding: test/test-reports/python-pytest/inductor.test_cuda_cpp_wrapper/inductor.test_cuda_cpp_wrapper-64381d7aaba237c7.xml (deflated 87%) 2024-06-26T08:04:34.4010470Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-906b5e507b639582.xml (deflated 27%) 2024-06-26T08:04:34.4012125Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-9cc98cc6263d08c8.xml (deflated 28%) 2024-06-26T08:04:34.4013803Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-cfbcfc264dfc7dbc.xml (deflated 28%) 2024-06-26T08:04:34.4015541Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-47520ac921df7d1f.xml (deflated 92%) 2024-06-26T08:04:34.4017196Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-887f5331157d176e.xml (deflated 93%) 2024-06-26T08:04:34.4019031Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_opinfo/inductor.test_torchinductor_opinfo-88c8e98abf1ba266.xml (deflated 92%) 2024-06-26T08:04:34.4020775Z adding: test/test-reports/python-pytest/inductor.test_inplacing_pass/inductor.test_inplacing_pass-619c031f3bd6e7f4.xml (deflated 28%) 2024-06-26T08:04:34.4022389Z adding: test/test-reports/python-pytest/inductor.test_inplacing_pass/inductor.test_inplacing_pass-8f587735d3f2aa85.xml (deflated 79%) 2024-06-26T08:04:34.4023943Z adding: test/test-reports/python-pytest/inductor.test_config/inductor.test_config-ae85891966749f4e.xml (deflated 28%) 2024-06-26T08:04:34.4025484Z adding: test/test-reports/python-pytest/inductor.test_config/inductor.test_config-7ed832ec4cfcdc69.xml (deflated 85%) 2024-06-26T08:04:34.4027090Z adding: test/test-reports/python-pytest/inductor.test_cpp_wrapper_hipify/inductor.test_cpp_wrapper_hipify-26429e936baf7c6e.xml (deflated 28%) 2024-06-26T08:04:34.4028764Z adding: test/test-reports/python-pytest/inductor.test_cpp_wrapper_hipify/inductor.test_cpp_wrapper_hipify-8c8bdf033a08de90.xml (deflated 59%) 2024-06-26T08:04:34.4030422Z adding: test/test-reports/python-pytest/inductor.test_extension_backend/inductor.test_extension_backend-be1eb92e172a699f.xml (deflated 28%) 2024-06-26T08:04:34.4032033Z adding: test/test-reports/python-pytest/inductor.test_extension_backend/inductor.test_extension_backend-e36c0a7561e10a2f.xml (deflated 52%) 2024-06-26T08:04:34.4033646Z adding: test/test-reports/python-pytest/inductor.test_indexing/inductor.test_indexing-bc4d1c6ed55edaf8.xml (deflated 28%) 2024-06-26T08:04:34.4035125Z adding: test/test-reports/python-pytest/inductor.test_indexing/inductor.test_indexing-46214c66632828aa.xml (deflated 79%) 2024-06-26T08:04:34.4036677Z adding: test/test-reports/python-pytest/inductor.test_group_batch_fusion/inductor.test_group_batch_fusion-2182016ffadd41dd.xml (deflated 28%) 2024-06-26T08:04:34.4038291Z adding: test/test-reports/python-pytest/inductor.test_group_batch_fusion/inductor.test_group_batch_fusion-84b965d494e188fb.xml (deflated 70%) 2024-06-26T08:04:34.4039911Z adding: test/test-reports/python-pytest/inductor.test_torchbind/inductor.test_torchbind-316a23295c944bbe.xml (deflated 28%) 2024-06-26T08:04:34.4042855Z adding: test/test-reports/python-pytest/inductor.test_torchbind/inductor.test_torchbind-d0c0262a0fc5cbe9.xml (deflated 83%) 2024-06-26T08:04:34.4044369Z adding: test/test-reports/python-pytest/inductor.test_utils/inductor.test_utils-fee88923d63e0da9.xml (deflated 28%) 2024-06-26T08:04:34.4045979Z adding: test/test-reports/python-pytest/inductor.test_utils/inductor.test_utils-89518936c0a9690b.xml (deflated 36%) 2024-06-26T08:04:34.4047397Z adding: test/test-reports/python-pytest/inductor.test_fx_fusion/inductor.test_fx_fusion-83b412ed32a4c9aa.xml (deflated 28%) 2024-06-26T08:04:34.4048865Z adding: test/test-reports/python-pytest/inductor.test_fx_fusion/inductor.test_fx_fusion-309c92dc6620200e.xml (deflated 63%) 2024-06-26T08:04:34.4050306Z adding: test/test-reports/python-pytest/inductor.test_metrics/inductor.test_metrics-c397e6c44d4ded0f.xml (deflated 28%) 2024-06-26T08:04:34.4051773Z adding: test/test-reports/python-pytest/inductor.test_metrics/inductor.test_metrics-53d25efa47c1e091.xml (deflated 66%) 2024-06-26T08:04:34.4053279Z adding: test/test-reports/python-pytest/inductor.test_triton_kernels/inductor.test_triton_kernels-df17dd1923940ae4.xml (deflated 28%) 2024-06-26T08:04:34.4054833Z adding: test/test-reports/python-pytest/inductor.test_triton_kernels/inductor.test_triton_kernels-0606bcf222551b59.xml (deflated 94%) 2024-06-26T08:04:34.4056327Z adding: test/test-reports/python-pytest/inductor.test_foreach/inductor.test_foreach-799079e161475243.xml (deflated 28%) 2024-06-26T08:04:34.4057713Z adding: test/test-reports/python-pytest/inductor.test_foreach/inductor.test_foreach-38022ea79957e71f.xml (deflated 96%) 2024-06-26T08:04:34.4059027Z adding: test/test-reports/python-pytest/test_ops/test_ops-f732fa1f9b36138e.xml (deflated 28%) 2024-06-26T08:04:34.4207080Z adding: test/test-reports/python-pytest/test_ops/test_ops-a4f99d7f0e6e735d.xml (deflated 97%) 2024-06-26T08:04:34.4208394Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-c5f844ba029f2cae.xml (deflated 28%) 2024-06-26T08:04:34.4241167Z adding: test/test-reports/python-pytest/test_sparse_csr/test_sparse_csr-f3650e6695648c28.xml (deflated 96%) 2024-06-26T08:04:34.4242457Z adding: test/test-reports/python-pytest/test_foreach/test_foreach-aca865f6a490f820.xml (deflated 28%) 2024-06-26T08:04:34.4284377Z adding: test/test-reports/python-pytest/test_foreach/test_foreach-d1220c75ad0c9137.xml (deflated 96%) 2024-06-26T08:04:34.4285985Z adding: test/test-reports/python-pytest/test_transformers/test_transformers-5b32ba5cdc408d0d.xml (deflated 28%) 2024-06-26T08:04:34.4749060Z adding: test/test-reports/python-pytest/test_transformers/test_transformers-72856e31aa060793.xml (deflated 98%) 2024-06-26T08:04:34.4807124Z ##[group]Run # Remove any previous usage logs if they exist 2024-06-26T08:04:34.4807752Z # Remove any previous usage logs if they exist 2024-06-26T08:04:34.4808256Z rm -f logs-*.zip 2024-06-26T08:04:34.4808921Z # this workflow is also run in bazel build test, but we dont generate usage reports for it 2024-06-26T08:04:34.4809723Z # so check to see if the file exists first 2024-06-26T08:04:34.4810310Z if [ -f 'usage_log.txt' ]; then 2024-06-26T08:04:34.4810833Z  zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt' 2024-06-26T08:04:34.4811312Z fi 2024-06-26T08:04:34.4811661Z if ls test/**/*.log 1> /dev/null 2>&1; then 2024-06-26T08:04:34.4812231Z  zip -r "logs-${FILE_SUFFIX}.zip" test -i '*.log' 2024-06-26T08:04:34.4812717Z fi 2024-06-26T08:04:34.4820239Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:34.4820734Z env: 2024-06-26T08:04:34.4821015Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.4821453Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.4822177Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.4823029Z FILE_SUFFIX: test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777 2024-06-26T08:04:34.4823616Z ##[endgroup] 2024-06-26T08:04:34.4883423Z adding: usage_log.txt (deflated 92%) 2024-06-26T08:04:34.5067365Z adding: test/test-reports/inductor.test_halide_1.1_cd76d1abf5dd8067_.log (stored 0%) 2024-06-26T08:04:34.5068831Z adding: test/test-reports/inductor.test_torchinductor_dynamic_shapes_1.5_5d16a40bb30a1f6d_.log (deflated 53%) 2024-06-26T08:04:34.5070259Z adding: test/test-reports/inductor.test_foreach_1.1_e8ddf0a34eccf5e8_.log (deflated 50%) 2024-06-26T08:04:34.5071861Z adding: test/test-reports/inductor.test_torchinductor_codegen_dynamic_shapes_2.4_04d45de13b6b3661_.log (deflated 58%) 2024-06-26T08:04:34.5073437Z adding: test/test-reports/inductor.test_config_1.1_2d3624f5a0e8c289_.log (deflated 72%) 2024-06-26T08:04:34.5074838Z adding: test/test-reports/inductor.test_torchinductor_2.5_02d523ab75dc5315_.log (deflated 51%) 2024-06-26T08:04:34.5075917Z adding: test/test-reports/inductor.test_indexing_1.1_0201ed5a8c65659e_.log (deflated 77%) 2024-06-26T08:04:34.5076938Z adding: test/test-reports/inductor.test_torchinductor_3.5_68c4a191272db99e_.log (deflated 51%) 2024-06-26T08:04:34.5077941Z adding: test/test-reports/inductor.test_smoke_1.1_e1fdfda78eda197c_.log (stored 0%) 2024-06-26T08:04:34.5078935Z adding: test/test-reports/inductor.test_torchinductor_5.5_10872025cba9e9d7_.log (deflated 52%) 2024-06-26T08:04:34.5080015Z adding: test/test-reports/inductor.test_torchbind_1.1_a75708f0844c4663_.log (deflated 72%) 2024-06-26T08:04:34.5081036Z adding: test/test-reports/inductor.test_aot_inductor_4.11_d7fa306fe1a4f548_.log (deflated 51%) 2024-06-26T08:04:34.5082005Z adding: test/test-reports/test_foreach_1.1_ed29ca00e5b5073e_.log (deflated 49%) 2024-06-26T08:04:34.5083230Z adding: test/test-reports/inductor.test_aot_inductor_10.11_2043ffc964267947_.log (deflated 51%) 2024-06-26T08:04:34.5084411Z adding: test/test-reports/test_transformers_1.3_7fb68648f3e484ac_.log (deflated 49%) 2024-06-26T08:04:34.5085738Z adding: test/test-reports/inductor.test_cuda_cpp_wrapper_2.4_5c2ded4e6b8c41f0_.log (deflated 51%) 2024-06-26T08:04:34.5086845Z adding: test/test-reports/inductor.test_torchinductor_3.5_f5d0a8dec275c195_.log (deflated 87%) 2024-06-26T08:04:34.5087939Z adding: test/test-reports/inductor.test_cuda_cpp_wrapper_4.4_f023de498d1dcd20_.log (deflated 51%) 2024-06-26T08:04:34.5092139Z adding: test/test-reports/inductor.test_torchinductor_5.5_c773631a35c5b5f5_.log (deflated 87%) 2024-06-26T08:04:34.5093582Z adding: test/test-reports/inductor.test_torchinductor_opinfo_4.16_f134ef8362d55d36_.log (deflated 52%) 2024-06-26T08:04:34.5094916Z adding: test/test-reports/test_ops_2.7_89a369eb891d9f57_.log (deflated 49%) 2024-06-26T08:04:34.5095947Z adding: test/test-reports/inductor.test_torchinductor_opinfo_13.16_9a70c8c8f24f3f61_.log (deflated 52%) 2024-06-26T08:04:34.5096987Z adding: test/test-reports/test_sparse_csr_2.2_9210a3ca3f41947b_.log (deflated 49%) 2024-06-26T08:04:34.5098317Z adding: test/test-reports/inductor.test_torchinductor_opinfo_15.16_1e51e04d7c702944_.log (deflated 52%) 2024-06-26T08:04:34.5099766Z adding: test/test-reports/inductor.test_aot_inductor_4.11_fd0ee16c73e05668_.log (deflated 90%) 2024-06-26T08:04:34.5100826Z adding: test/test-reports/inductor.test_inductor_utils_1.1_ad70e84a7e428bf8_.log (stored 0%) 2024-06-26T08:04:34.5101832Z adding: test/test-reports/inductor.test_inductor_utils_1.1_44e7b565483f6937_.log (stored 0%) 2024-06-26T08:04:34.5102898Z adding: test/test-reports/inductor.test_inplacing_pass_1.1_80b1e15494706a6a_.log (deflated 51%) 2024-06-26T08:04:34.5104182Z adding: test/test-reports/inductor.test_utils_1.1_60246bc6b41dc106_.log (deflated 50%) 2024-06-26T08:04:34.5105538Z adding: test/test-reports/inductor.test_config_1.1_049251ccc3c354eb_.log (deflated 50%) 2024-06-26T08:04:34.5106629Z adding: test/test-reports/inductor.test_aot_inductor_10.11_ff5c2860c2121ea8_.log (deflated 89%) 2024-06-26T08:04:34.5107729Z adding: test/test-reports/inductor.test_cpp_wrapper_hipify_1.1_9d82448615fcae73_.log (deflated 51%) 2024-06-26T08:04:34.5108948Z adding: test/test-reports/inductor.test_cuda_cpp_wrapper_4.4_4fa77891ab73d656_.log (deflated 82%) 2024-06-26T08:04:34.5110123Z adding: test/test-reports/inductor.test_extension_backend_1.1_57ce8c8d87880ebf_.log (deflated 51%) 2024-06-26T08:04:34.5185781Z adding: test/test-reports/test_foreach_1.1_3c3369c72938ff4f_.log (deflated 95%) 2024-06-26T08:04:34.5187020Z adding: test/test-reports/inductor.test_smoke_1.1_333d89a53a0a8ead_.log (stored 0%) 2024-06-26T08:04:34.5188246Z adding: test/test-reports/inductor.test_fx_fusion_1.1_f1abefe51a6cfa81_.log (deflated 62%) 2024-06-26T08:04:34.5189624Z adding: test/test-reports/inductor.test_indexing_1.1_c0b3d8d346f79d29_.log (deflated 51%) 2024-06-26T08:04:34.5190797Z adding: test/test-reports/inductor.test_cuda_cpp_wrapper_2.4_df38bdac7176431b_.log (deflated 85%) 2024-06-26T08:04:34.5191890Z adding: test/test-reports/inductor.test_group_batch_fusion_1.1_0e4afa83a0bf033f_.log (deflated 68%) 2024-06-26T08:04:34.5192933Z adding: test/test-reports/inductor.test_metrics_1.1_c110c5bd840602fa_.log (deflated 64%) 2024-06-26T08:04:34.5193936Z adding: test/test-reports/inductor.test_torchbind_1.1_ea6b0c226fb66272_.log (deflated 51%) 2024-06-26T08:04:34.5302761Z adding: test/test-reports/test_ops_2.7_6d7de55c4ddf1d61_.log (deflated 92%) 2024-06-26T08:04:34.5303809Z adding: test/test-reports/inductor.test_utils_1.1_f0f6b9795669181b_.log (deflated 50%) 2024-06-26T08:04:34.6030157Z adding: test/test-reports/test_transformers_1.3_6bed6d32a3c435d1_.log (deflated 98%) 2024-06-26T08:04:34.6031512Z adding: test/test-reports/inductor.test_fx_fusion_1.1_db50191f29b17c70_.log (deflated 51%) 2024-06-26T08:04:34.6032727Z adding: test/test-reports/inductor.test_halide_1.1_843967e48ebe8469_.log (stored 0%) 2024-06-26T08:04:34.6033711Z adding: test/test-reports/inductor.test_metrics_1.1_857a03fea532330d_.log (deflated 50%) 2024-06-26T08:04:34.6034844Z adding: test/test-reports/inductor.test_inplacing_pass_1.1_61eba4467e022075_.log (deflated 68%) 2024-06-26T08:04:34.6035906Z adding: test/test-reports/inductor.test_triton_kernels_1.1_3129f1f01b7932c4_.log (deflated 51%) 2024-06-26T08:04:34.6090640Z adding: test/test-reports/test_sparse_csr_2.2_71d3a4f437f0d4db_.log (deflated 95%) 2024-06-26T08:04:34.6097054Z adding: test/test-reports/inductor.test_torchinductor_opinfo_13.16_13ab8c4fc0a42b82_.log (deflated 91%) 2024-06-26T08:04:34.6105530Z adding: test/test-reports/inductor.test_torchinductor_dynamic_shapes_1.5_5f33a46227fce35c_.log (deflated 91%) 2024-06-26T08:04:34.6112484Z adding: test/test-reports/inductor.test_torchinductor_2.5_7765004cb3db638e_.log (deflated 87%) 2024-06-26T08:04:34.6123762Z adding: test/test-reports/inductor.test_torchinductor_codegen_dynamic_shapes_2.4_8ea40204817c88d0_.log (deflated 92%) 2024-06-26T08:04:34.6130906Z adding: test/test-reports/inductor.test_torchinductor_opinfo_15.16_6404d1a0ec63af14_.log (deflated 91%) 2024-06-26T08:04:34.6132400Z adding: test/test-reports/inductor.test_cpp_wrapper_hipify_1.1_e9ebae6351a551ef_.log (deflated 62%) 2024-06-26T08:04:34.6133514Z adding: test/test-reports/inductor.test_extension_backend_1.1_ba5fa1db86d1a95d_.log (deflated 60%) 2024-06-26T08:04:34.6139218Z adding: test/test-reports/inductor.test_torchinductor_opinfo_4.16_5e420b85eaf8f346_.log (deflated 91%) 2024-06-26T08:04:34.6140608Z adding: test/test-reports/inductor.test_group_batch_fusion_1.1_8cfd52ff0c78f475_.log (deflated 77%) 2024-06-26T08:04:34.6146291Z adding: test/test-reports/inductor.test_triton_kernels_1.1_b31a980d81377bd5_.log (deflated 92%) 2024-06-26T08:04:34.6150823Z adding: test/test-reports/inductor.test_foreach_1.1_9969b32df902293a_.log (deflated 91%) 2024-06-26T08:04:34.6180594Z ##[group]Run # Remove any previous debugging artifacts if they exist 2024-06-26T08:04:34.6181308Z # Remove any previous debugging artifacts if they exist 2024-06-26T08:04:34.6181873Z rm -f debug-*.zip 2024-06-26T08:04:34.6182257Z if [ -d 'test/debug' ]; then 2024-06-26T08:04:34.6182767Z  zip -r "debug-${FILE_SUFFIX}.zip" test/debug 2024-06-26T08:04:34.6183310Z fi 2024-06-26T08:04:34.6191202Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:34.6191694Z env: 2024-06-26T08:04:34.6191971Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.6192421Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.6193153Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.6193999Z FILE_SUFFIX: test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777 2024-06-26T08:04:34.6194581Z ##[endgroup] 2024-06-26T08:04:34.6270127Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-06-26T08:04:34.6270562Z with: 2024-06-26T08:04:34.6270848Z s3-bucket: gha-artifacts 2024-06-26T08:04:34.6271273Z s3-prefix: pytorch/pytorch/9673645538/1/artifact 2024-06-26T08:04:34.6271740Z retention-days: 14 2024-06-26T08:04:34.6272078Z if-no-files-found: warn 2024-06-26T08:04:34.6272431Z path: test-jsons-*.zip 2024-06-26T08:04:34.6272766Z name: artifact 2024-06-26T08:04:34.6273066Z region: us-east-1 2024-06-26T08:04:34.6273369Z env: 2024-06-26T08:04:34.6273635Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:34.6274084Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:34.6274817Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:34.6275455Z ##[endgroup] 2024-06-26T08:04:34.9391385Z NOTE: s3-prefix specified, ignoring name parameter 2024-06-26T08:04:34.9391994Z With the provided path, there will be 1 file uploaded 2024-06-26T08:04:34.9392640Z Uploading to s3 prefix: pytorch/pytorch/9673645538/1/artifact 2024-06-26T08:04:34.9418293Z Starting upload of test-jsons-test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777.zip 2024-06-26T08:04:35.0713225Z Finished upload of test-jsons-test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777.zip 2024-06-26T08:04:35.0942514Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-06-26T08:04:35.0942951Z with: 2024-06-26T08:04:35.0943235Z s3-bucket: gha-artifacts 2024-06-26T08:04:35.0943656Z s3-prefix: pytorch/pytorch/9673645538/1/artifact 2024-06-26T08:04:35.0944114Z retention-days: 14 2024-06-26T08:04:35.0944449Z if-no-files-found: error 2024-06-26T08:04:35.0944807Z path: test-reports-*.zip 2024-06-26T08:04:35.0945288Z name: artifact 2024-06-26T08:04:35.0945589Z region: us-east-1 2024-06-26T08:04:35.0945883Z env: 2024-06-26T08:04:35.0946146Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:35.0946589Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:35.0947437Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:35.0948067Z ##[endgroup] 2024-06-26T08:04:35.4031147Z NOTE: s3-prefix specified, ignoring name parameter 2024-06-26T08:04:35.4032557Z With the provided path, there will be 1 file uploaded 2024-06-26T08:04:35.4033773Z Uploading to s3 prefix: pytorch/pytorch/9673645538/1/artifact 2024-06-26T08:04:35.4057755Z Starting upload of test-reports-test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777.zip 2024-06-26T08:04:35.6180148Z Finished upload of test-reports-test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777.zip 2024-06-26T08:04:35.6325588Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-06-26T08:04:35.6326033Z with: 2024-06-26T08:04:35.6326319Z s3-bucket: gha-artifacts 2024-06-26T08:04:35.6326741Z s3-prefix: pytorch/pytorch/9673645538/1/artifact 2024-06-26T08:04:35.6327204Z retention-days: 14 2024-06-26T08:04:35.6327544Z if-no-files-found: ignore 2024-06-26T08:04:35.6327900Z path: logs-*.zip 2024-06-26T08:04:35.6328213Z name: artifact 2024-06-26T08:04:35.6328518Z region: us-east-1 2024-06-26T08:04:35.6328813Z env: 2024-06-26T08:04:35.6329082Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:35.6329535Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:35.6330256Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:35.6330881Z ##[endgroup] 2024-06-26T08:04:35.9418711Z NOTE: s3-prefix specified, ignoring name parameter 2024-06-26T08:04:35.9419529Z With the provided path, there will be 1 file uploaded 2024-06-26T08:04:35.9420165Z Uploading to s3 prefix: pytorch/pytorch/9673645538/1/artifact 2024-06-26T08:04:35.9445821Z Starting upload of logs-test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777.zip 2024-06-26T08:04:36.1653525Z Finished upload of logs-test-default-5-5-linux.g5.4xlarge.nvidia.gpu_26688880777.zip 2024-06-26T08:04:36.1798448Z ##[group]Run seemethere/upload-artifact-s3@v5 2024-06-26T08:04:36.1798893Z with: 2024-06-26T08:04:36.1799197Z s3-bucket: gha-artifacts 2024-06-26T08:04:36.1799622Z s3-prefix: pytorch/pytorch/9673645538/1/artifact 2024-06-26T08:04:36.1800086Z retention-days: 14 2024-06-26T08:04:36.1800431Z if-no-files-found: ignore 2024-06-26T08:04:36.1800791Z path: debug-*.zip 2024-06-26T08:04:36.1801094Z name: artifact 2024-06-26T08:04:36.1801395Z region: us-east-1 2024-06-26T08:04:36.1801689Z env: 2024-06-26T08:04:36.1801968Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:36.1802430Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:36.1803166Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:36.1803801Z ##[endgroup] 2024-06-26T08:04:36.4848283Z No files were found with the provided path: debug-*.zip. No artifacts will be uploaded. 2024-06-26T08:04:36.4993055Z ##[group]Run # shellcheck disable=SC2156 2024-06-26T08:04:36.4993547Z # shellcheck disable=SC2156 2024-06-26T08:04:36.4994358Z find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; 2024-06-26T08:04:36.5002317Z shell: /usr/bin/bash -e {0} 2024-06-26T08:04:36.5002669Z env: 2024-06-26T08:04:36.5003028Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:36.5003472Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:36.5004202Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:36.5005086Z ##[endgroup] 2024-06-26T08:04:36.6629229Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2024-06-26T08:04:36.6629819Z with: 2024-06-26T08:04:36.6630084Z env: 2024-06-26T08:04:36.6630369Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:36.6630831Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:36.6631610Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:36.6632288Z ##[endgroup] 2024-06-26T08:04:36.6652656Z ##[group]Run set -eou pipefail 2024-06-26T08:04:36.6653074Z set -eou pipefail 2024-06-26T08:04:36.6653436Z  2024-06-26T08:04:36.6653969Z echo "Holding runner for 2 hours until all ssh sessions have logged out" 2024-06-26T08:04:36.6654639Z for _ in $(seq 1440); do 2024-06-26T08:04:36.6655123Z  # Break if no ssh session exists anymore 2024-06-26T08:04:36.6655635Z  if [ "$(who)" = "" ]; then 2024-06-26T08:04:36.6656052Z  break 2024-06-26T08:04:36.6656403Z  fi 2024-06-26T08:04:36.6656707Z  echo "." 2024-06-26T08:04:36.6657045Z  sleep 5 2024-06-26T08:04:36.6657368Z done 2024-06-26T08:04:36.6664788Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:36.6665378Z env: 2024-06-26T08:04:36.6665654Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:36.6666103Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:36.6666852Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:36.6667507Z ##[endgroup] 2024-06-26T08:04:36.6687905Z Holding runner for 2 hours until all ssh sessions have logged out 2024-06-26T08:04:36.6725087Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2024-06-26T08:04:36.6725872Z # ignore expansion of "docker ps -q" since it could be empty 2024-06-26T08:04:36.6726477Z # shellcheck disable=SC2046 2024-06-26T08:04:36.6727062Z docker stop $(docker ps -q) || true 2024-06-26T08:04:36.6727547Z # Prune all of the docker images 2024-06-26T08:04:36.6728001Z docker system prune -af 2024-06-26T08:04:36.6734942Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:36.6735440Z env: 2024-06-26T08:04:36.6735718Z GIT_DEFAULT_BRANCH: main 2024-06-26T08:04:36.6736170Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2024-06-26T08:04:36.6736900Z DOCKER_CONTAINER_ID: 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:36.6737546Z ##[endgroup] 2024-06-26T08:04:37.1246411Z 2bc440a16935 2024-06-26T08:04:39.7022674Z Deleted Containers: 2024-06-26T08:04:39.7023382Z 2bc440a1693506f18cd5cd4bb30e53b48761fa5596d546d43118c14a72aff368 2024-06-26T08:04:39.7023921Z 2024-06-26T08:04:44.3400700Z Deleted Images: 2024-06-26T08:04:44.3402390Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9:91382da70d5719cd7007b6b80b71d2f48398f6b7 2024-06-26T08:04:44.3405268Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9@sha256:039956dde43d3becfe71079682edc6c7db7202c8395dd9e2a64a89dc944300d0 2024-06-26T08:04:44.3407229Z deleted: sha256:275441d577b6501c7c3f0a3f3b304d8adc9ce71705f2cbe57152c9a00c123a71 2024-06-26T08:04:44.3408491Z deleted: sha256:48ef1316cdead9adb2a1514faecc7f87b65cecf5074675bc556ddde116f4bdcb 2024-06-26T08:04:44.3409789Z deleted: sha256:527bafafb93a7d8c4bfa5832b9380883cec8d9348ebc39dd4c3b94af4d8db300 2024-06-26T08:04:44.3411013Z deleted: sha256:a8258a528ee3d90498b9dc928d13af45f0127c5eb63fb0615fafd7c7085092e8 2024-06-26T08:04:44.3411893Z deleted: sha256:0e947a4200a79ed51f0d4965893e33e9d718c6d48135338dd65aa9f8710c93fd 2024-06-26T08:04:44.3412925Z deleted: sha256:be801d06ff1758efafde42497fd7e6ce082d7e5563e310099936132036a02b31 2024-06-26T08:04:44.3413764Z deleted: sha256:c653b0fdb77d4538c77a540e3bf1f26742c1f2c83bb9aef881909e770193bf1b 2024-06-26T08:04:44.3414738Z deleted: sha256:dff7ac83f814f2f1cbd50d6a752a8908169e57a0548eaefa4320e025fc51e4aa 2024-06-26T08:04:44.3415602Z deleted: sha256:0797b5abaa21a4dea33f1d4e0bd5582780df4ebfbe193eabba4e016b310b5a33 2024-06-26T08:04:44.3416468Z deleted: sha256:c5a0bd86f89ddd7cbd6571031006004f25b99bf2a97873d820f6114299f95f84 2024-06-26T08:04:44.3417305Z deleted: sha256:84762fa180febb2e83cd02c4e40a05c60f2b008d5c0071654ddd08f7ba5f3652 2024-06-26T08:04:44.3418140Z deleted: sha256:25ce870269567e88b87af490736c6e5ccc32ffccd95f5dd277924b02717778e4 2024-06-26T08:04:44.3418985Z deleted: sha256:c6a01862d4b36484a7f23a2dca4d1d2d4c1ab08eabdc8f97329889079d283e30 2024-06-26T08:04:44.3419829Z deleted: sha256:fd6e7b149a4eb8518da5e13e47ea26df1cb57a58daef252905c68c80a1ddb40e 2024-06-26T08:04:44.3420687Z deleted: sha256:925ec5095811af3ced060a3c017d96ebb148320b9e85b8bce1c4be5872f9d349 2024-06-26T08:04:44.3421591Z deleted: sha256:ed5e05eb6ec533285d28a13f0f914e47365e2f514d7451a524fb84639562c2a6 2024-06-26T08:04:44.3422569Z deleted: sha256:f22bb471fb2f12b58d3c58cf8a4370424422c854d3e0e81c0f05c80b30c29d0c 2024-06-26T08:04:44.3423552Z deleted: sha256:f40fb8a3eca4250fcff474a07a96bdb46abdd34120555670e55427d9cc824bee 2024-06-26T08:04:44.3424400Z deleted: sha256:ac27f68e58d1bac6f2126d86a42b76b91bc8f6665fa38c028811641e3fbb4374 2024-06-26T08:04:44.3425373Z deleted: sha256:e91e5f9d0fe885b286c798da077d0e6f54da36295848268be3b600c7b8d4b58b 2024-06-26T08:04:44.3426218Z deleted: sha256:6df9d5aba8ad30bba22f79490df60183d6256f7bdd4f14c223ed93ee70736ce0 2024-06-26T08:04:44.3427073Z deleted: sha256:b1077f7a9b0b62132e60e0236dc6da5b7cfbe7d7d938d95bd857e490dc609d11 2024-06-26T08:04:44.3427923Z deleted: sha256:5f1f70f2e2d5c67f0d693e39bbf65959a331af08ae4b2908f449decdef7d95fc 2024-06-26T08:04:44.3428795Z deleted: sha256:816091d82a4d43a1925b0b9ba136c3c091289d1ff175b9271c6f4608130aa5d3 2024-06-26T08:04:44.3429628Z deleted: sha256:e9157a43ffe7ee42d7dc341528ca13572ea01dc7f579faae67a95990b48d69a3 2024-06-26T08:04:44.3430447Z deleted: sha256:59d54909386f7f5683dd32073531866c3b614c96b467b49c28b20e57ee021e5b 2024-06-26T08:04:44.3431359Z deleted: sha256:0654690ba2334b9ded01bb936df191a0871c75343324177ee31cc624e3d16afe 2024-06-26T08:04:44.3432195Z deleted: sha256:32b433821dc72d54a1f3585b1db119bf587b9a71e732cc4734c879532ec7520a 2024-06-26T08:04:44.3433044Z deleted: sha256:68cfdd0cbe2ab8d0490c8c5c60078c5890f2cee220f0b09d103e5d50e85cab19 2024-06-26T08:04:44.3433910Z deleted: sha256:b9bb69ac7eb95de9e9db7e756f78d4a8b5f38249e1ee3d98948f702d2064e130 2024-06-26T08:04:44.3434770Z deleted: sha256:9be36668e6bdc3ae8714a8ebc0232760469a0548ba4c42df21aefc2781126349 2024-06-26T08:04:44.3435639Z deleted: sha256:4d71cd29e89fec4cfd5d99ee604fd2d2cfe109caa9475fb7adbcf994d66591fa 2024-06-26T08:04:44.3436589Z deleted: sha256:140a23260239b942e7db87d4b474a7061e7d1a17b6a45217d3f6bdb6c0eafa15 2024-06-26T08:04:44.3437468Z deleted: sha256:dea1ac0ed17850404c5d7edad592904fac943ebb15fb854324f04c7a5fe1e26d 2024-06-26T08:04:44.3438311Z deleted: sha256:7e06a6998240afcce0855f0a1c3aed4f38f8e77ae9f054503d2beeb6bb997022 2024-06-26T08:04:44.3439162Z deleted: sha256:d738d0217343616eca2354f4ec3ba58983a73bdae8afd1a9f92cfa1235154833 2024-06-26T08:04:44.3440013Z deleted: sha256:12a3e2812fd74f530cb07c1bfc050e74c2ee21d9f0885d9afe47e15064aab1ca 2024-06-26T08:04:44.3440881Z deleted: sha256:e9f2e9dbba771327ae0974836faadb351a8ae45ccb498d5b8b6dfaf5a848ebf6 2024-06-26T08:04:44.3441734Z deleted: sha256:ae5f08639fcf17b8debf2327add62d8487137e5e9e9b2743afc2894aa02f3356 2024-06-26T08:04:44.3442625Z deleted: sha256:22d1e57a2f0d21a683b498830bdc88f95a3e783d984b49d817bab851bcb72098 2024-06-26T08:04:44.3443460Z deleted: sha256:5f98e54f0bd94e0dbb86c9a2a23cbf54370b85467ab5a46e48a816cb9e5e80ce 2024-06-26T08:04:44.3444298Z deleted: sha256:c6f320ba5799d4be66e06040946c6fbbaa9d687c96c26851b24be99e53c667b3 2024-06-26T08:04:44.3445395Z deleted: sha256:02de20218260064602fc2b0e6db57b7da6529a08323a4736ede72b7515f1dfd6 2024-06-26T08:04:44.3446210Z deleted: sha256:aae23064be470a2431950e48d85463d5ad200708d6243dc173684bc51b87f39b 2024-06-26T08:04:44.3447139Z deleted: sha256:e6ea83a05504e6f708bfd4eb9825fa38622178939b8d6c2d6a37a9640fb179f4 2024-06-26T08:04:44.3447982Z deleted: sha256:996cf892d3122dadac3cbdcf5ec1099333975b8fbe5c3434024c85496b05f136 2024-06-26T08:04:44.3448814Z deleted: sha256:ec069cecf8e032884ae3963eba8760db966d51d852195239c7ee7b7239d8261d 2024-06-26T08:04:44.3449632Z deleted: sha256:b74817a482521e7b6969bb14c3e42e93ba8460d116796ec1ed546374b9906657 2024-06-26T08:04:44.3450430Z deleted: sha256:dc244a86208996bc41077c078d7077654c63f379291594c1eaddd9950dc0f39d 2024-06-26T08:04:44.3451239Z deleted: sha256:b04224594db4b0368e7d2fc8091f207679a4cc15c4390b737081e4cb1ce84ea6 2024-06-26T08:04:44.3452048Z deleted: sha256:f2f896483269235206983d88fe7df7c4a85c0f4b4ae5847016689ea4ebde88fe 2024-06-26T08:04:44.3452869Z deleted: sha256:e6b1c321b72d19fe51cb6cc830d0f70ede34c1f16a4f68e2f1875318ac6b245c 2024-06-26T08:04:44.3453706Z deleted: sha256:3ad33b8473e2326bc77fde118e111afee569f038e280e1fab0c05b078b352402 2024-06-26T08:04:44.3454542Z deleted: sha256:91b568585679d77ca154afeb9ccf3b822ce0e433d6f9c91e7d8171ab0f8d1586 2024-06-26T08:04:44.3455366Z deleted: sha256:6e0a11cb1bc362c11171555cfabb9294f0d03f1b39f9c9fe206fb2949c52a4c4 2024-06-26T08:04:44.3456201Z deleted: sha256:100f27f8a5b6b523b486c2e1f4865fe7f19e12ee67d4ca9221f7f5ba0e1bb050 2024-06-26T08:04:44.3457029Z deleted: sha256:eb966417e6e550953cd8a788a4fda57a309890cd956c000f5c49290fc804be8c 2024-06-26T08:04:44.3457869Z deleted: sha256:12e4c64468492ae61c7e7bed6b81fc70a90d288e554d2490cb025945c5ec78a3 2024-06-26T08:04:44.3458707Z deleted: sha256:f24177d71a5854a4a1fafb7b56f4d89e5be5d7c93ab9ec5078fdcb2573a9e1d1 2024-06-26T08:04:44.3459549Z deleted: sha256:3cd559ed7b4e2ab7e1148b7e9392691e64c3de5b42d11668036c71596b33fc44 2024-06-26T08:04:44.3460396Z deleted: sha256:feee8cd1a53ce1188300dca77be8f02c4cd49f93170acb633fd261dfc054c3fc 2024-06-26T08:04:44.3461241Z deleted: sha256:d3b76c10044f1477fa75b5fd9cefecf32fea66b8d08438c3266d0b154e0715df 2024-06-26T08:04:44.3462063Z deleted: sha256:914270edd2e25e6367e3d4d06845624243506b8ab95628b5011a6078ac4c9dcc 2024-06-26T08:04:44.3462920Z deleted: sha256:151628b41f95cf388c4067d4774eeed64e02da4253a2908b6383e7235eacb615 2024-06-26T08:04:44.3463723Z deleted: sha256:8520f9985a0935bac39e98a001e628241a2b7ebd6094ba8033f037950f7215d4 2024-06-26T08:04:44.3464563Z deleted: sha256:bc8e25be5ef99fe08c933537c6e3dc3ef3ccd7e496b47eb4b27c29bfab56ee34 2024-06-26T08:04:44.3465513Z deleted: sha256:a4aebc01e1e5c72dc89f8db2cab16ff53459395b4b8018928813b6213afbcf6b 2024-06-26T08:04:44.3466348Z deleted: sha256:c75ab46e74d5588dd8a5c353a36a814374a88c31a0ae9ad4545818f0faa78273 2024-06-26T08:04:44.3467181Z deleted: sha256:35a033279d5c9bd6b862bfa9331d8fcbf98bbaf778a778fabc882384db9204f8 2024-06-26T08:04:44.3468021Z deleted: sha256:ab9b00c496073d62e69e032178858d3d6d4c4bab9e87065d3785c6da57351d00 2024-06-26T08:04:44.3468846Z deleted: sha256:7ed947299e109c2459de6e240d86f049c30f93f2380659ce441d8737b8cd065f 2024-06-26T08:04:44.3469673Z deleted: sha256:410d5a7f7a9a1cb4551c106b9cc728a9dcff598f9be231a351ce6a0a33f81e64 2024-06-26T08:04:44.3470522Z deleted: sha256:5f0021bb56efa14bb93978c01513e2a1187ba30e69bdb0546d0ef39b30873f88 2024-06-26T08:04:44.3471357Z deleted: sha256:c1601aa97eb84151c14da3aeea351201bd99144d36d66b397f6555d80245d86d 2024-06-26T08:04:44.3472254Z deleted: sha256:72af05a89be22accdc1ca5d66dcbbb33993a9ef5997f849df1d4ba4c48049f25 2024-06-26T08:04:44.3473108Z deleted: sha256:38b90c9663dcaa2bc57a4dd3008298e7ea93e9535fa42312b4dc4246e7491af9 2024-06-26T08:04:44.3473969Z deleted: sha256:bda61b6cefb3ec8eeb74fef1ca1c7f9a5845fe5e8f07b8123323e897425d5c29 2024-06-26T08:04:44.3474829Z deleted: sha256:5a18e1aa877074529a84cbddf19f8d5403787823378ceae6b72fb62f78d43037 2024-06-26T08:04:44.3475656Z deleted: sha256:6c3e7df31590f02f10cb71fc4eb27653e9b428df2e6e5421a455b062bd2e39f9 2024-06-26T08:04:44.3476220Z 2024-06-26T08:04:44.3476367Z Total reclaimed space: 25.48GB 2024-06-26T08:04:44.3542373Z Post job cleanup. 2024-06-26T08:04:44.3595130Z Post job cleanup. 2024-06-26T08:04:44.4409065Z [command]/usr/bin/git version 2024-06-26T08:04:44.4452138Z git version 2.40.1 2024-06-26T08:04:44.4491764Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2a3648b2-3192-4ba4-9628-331c990202bf' before making global git config changes 2024-06-26T08:04:44.4493017Z Adding repository directory to the temporary git global config as a safe directory 2024-06-26T08:04:44.4497479Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2024-06-26T08:04:44.4529762Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2024-06-26T08:04:44.4552790Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2024-06-26T08:04:44.4777917Z Entering 'android/libs/fbjni' 2024-06-26T08:04:44.4815154Z Entering 'third_party/FP16' 2024-06-26T08:04:44.4852595Z Entering 'third_party/FXdiv' 2024-06-26T08:04:44.4888684Z Entering 'third_party/NNPACK' 2024-06-26T08:04:44.4925998Z Entering 'third_party/VulkanMemoryAllocator' 2024-06-26T08:04:44.4959611Z Entering 'third_party/XNNPACK' 2024-06-26T08:04:44.5006348Z Entering 'third_party/benchmark' 2024-06-26T08:04:44.5038940Z Entering 'third_party/cpp-httplib' 2024-06-26T08:04:44.5075608Z Entering 'third_party/cpuinfo' 2024-06-26T08:04:44.5111869Z Entering 'third_party/cudnn_frontend' 2024-06-26T08:04:44.5149357Z Entering 'third_party/cutlass' 2024-06-26T08:04:44.5189088Z Entering 'third_party/eigen' 2024-06-26T08:04:44.5226586Z Entering 'third_party/fbgemm' 2024-06-26T08:04:44.5263570Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-06-26T08:04:44.5294219Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T08:04:44.5328854Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-06-26T08:04:44.5365660Z Entering 'third_party/fbgemm/third_party/googletest' 2024-06-26T08:04:44.5401108Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T08:04:44.5438056Z Entering 'third_party/flatbuffers' 2024-06-26T08:04:44.5476547Z Entering 'third_party/fmt' 2024-06-26T08:04:44.5511404Z Entering 'third_party/foxi' 2024-06-26T08:04:44.5547200Z Entering 'third_party/gemmlowp/gemmlowp' 2024-06-26T08:04:44.5582876Z Entering 'third_party/gloo' 2024-06-26T08:04:44.5618931Z Entering 'third_party/googletest' 2024-06-26T08:04:44.5651132Z Entering 'third_party/ideep' 2024-06-26T08:04:44.5683030Z Entering 'third_party/ideep/mkl-dnn' 2024-06-26T08:04:44.5722561Z Entering 'third_party/ittapi' 2024-06-26T08:04:44.5758511Z Entering 'third_party/kineto' 2024-06-26T08:04:44.5793291Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T08:04:44.5828658Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T08:04:44.5864695Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T08:04:44.5895300Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T08:04:44.5930302Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T08:04:44.5962585Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T08:04:44.5999690Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T08:04:44.6032476Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T08:04:44.6067406Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T08:04:44.6100928Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T08:04:44.6137005Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T08:04:44.6171953Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T08:04:44.6207569Z Entering 'third_party/mimalloc' 2024-06-26T08:04:44.6242618Z Entering 'third_party/nccl/nccl' 2024-06-26T08:04:44.6279230Z Entering 'third_party/nlohmann' 2024-06-26T08:04:44.6317298Z Entering 'third_party/onnx' 2024-06-26T08:04:44.6364549Z Entering 'third_party/onnx/third_party/benchmark' 2024-06-26T08:04:44.6399959Z Entering 'third_party/onnx/third_party/pybind11' 2024-06-26T08:04:44.6439020Z Entering 'third_party/opentelemetry-cpp' 2024-06-26T08:04:44.6476723Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T08:04:44.6511917Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T08:04:44.6544987Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T08:04:44.6592579Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T08:04:44.6612622Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T08:04:44.6648011Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T08:04:44.6681667Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T08:04:44.6713141Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T08:04:44.6748968Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T08:04:44.6783937Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T08:04:44.6833298Z Entering 'third_party/pocketfft' 2024-06-26T08:04:44.6870616Z Entering 'third_party/protobuf' 2024-06-26T08:04:44.6905531Z Entering 'third_party/protobuf/third_party/benchmark' 2024-06-26T08:04:44.6937853Z Entering 'third_party/protobuf/third_party/googletest' 2024-06-26T08:04:44.6974283Z Entering 'third_party/psimd' 2024-06-26T08:04:44.7007889Z Entering 'third_party/pthreadpool' 2024-06-26T08:04:44.7040580Z Entering 'third_party/pybind11' 2024-06-26T08:04:44.7075517Z Entering 'third_party/python-peachpy' 2024-06-26T08:04:44.7109510Z Entering 'third_party/sleef' 2024-06-26T08:04:44.7144403Z Entering 'third_party/tensorpipe' 2024-06-26T08:04:44.7179957Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-06-26T08:04:44.7214733Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-06-26T08:04:44.7248438Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-06-26T08:04:44.7282832Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T08:04:44.7315885Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T08:04:44.7366428Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2024-06-26T08:04:44.7384467Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7392665Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2024-06-26T08:04:44.7417561Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2024-06-26T08:04:44.7623019Z Entering 'android/libs/fbjni' 2024-06-26T08:04:44.7644965Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7668750Z Entering 'third_party/FP16' 2024-06-26T08:04:44.7690565Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7712034Z Entering 'third_party/FXdiv' 2024-06-26T08:04:44.7733778Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7756773Z Entering 'third_party/NNPACK' 2024-06-26T08:04:44.7777688Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7800788Z Entering 'third_party/VulkanMemoryAllocator' 2024-06-26T08:04:44.7822391Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7845615Z Entering 'third_party/XNNPACK' 2024-06-26T08:04:44.7867348Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7902483Z Entering 'third_party/benchmark' 2024-06-26T08:04:44.7924555Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7948880Z Entering 'third_party/cpp-httplib' 2024-06-26T08:04:44.7969573Z http.https://github.com/.extraheader 2024-06-26T08:04:44.7993249Z Entering 'third_party/cpuinfo' 2024-06-26T08:04:44.8015947Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8040538Z Entering 'third_party/cudnn_frontend' 2024-06-26T08:04:44.8062422Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8086548Z Entering 'third_party/cutlass' 2024-06-26T08:04:44.8107676Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8136236Z Entering 'third_party/eigen' 2024-06-26T08:04:44.8158754Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8185665Z Entering 'third_party/fbgemm' 2024-06-26T08:04:44.8208171Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8232487Z Entering 'third_party/fbgemm/third_party/asmjit' 2024-06-26T08:04:44.8254899Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8279466Z Entering 'third_party/fbgemm/third_party/cpuinfo' 2024-06-26T08:04:44.8301191Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8325410Z Entering 'third_party/fbgemm/third_party/cutlass' 2024-06-26T08:04:44.8346961Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8375186Z Entering 'third_party/fbgemm/third_party/googletest' 2024-06-26T08:04:44.8395289Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8419821Z Entering 'third_party/fbgemm/third_party/hipify_torch' 2024-06-26T08:04:44.8440442Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8463970Z Entering 'third_party/flatbuffers' 2024-06-26T08:04:44.8487802Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8513891Z Entering 'third_party/fmt' 2024-06-26T08:04:44.8535746Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8559177Z Entering 'third_party/foxi' 2024-06-26T08:04:44.8582079Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8606436Z Entering 'third_party/gemmlowp/gemmlowp' 2024-06-26T08:04:44.8627218Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8651932Z Entering 'third_party/gloo' 2024-06-26T08:04:44.8673904Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8697671Z Entering 'third_party/googletest' 2024-06-26T08:04:44.8719944Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8743676Z Entering 'third_party/ideep' 2024-06-26T08:04:44.8766900Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8790053Z Entering 'third_party/ideep/mkl-dnn' 2024-06-26T08:04:44.8812045Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8839663Z Entering 'third_party/ittapi' 2024-06-26T08:04:44.8862569Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8887365Z Entering 'third_party/kineto' 2024-06-26T08:04:44.8909615Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8933645Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2024-06-26T08:04:44.8954735Z http.https://github.com/.extraheader 2024-06-26T08:04:44.8978340Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2024-06-26T08:04:44.9000344Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9024564Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2024-06-26T08:04:44.9047203Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9069718Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2024-06-26T08:04:44.9090942Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9115066Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2024-06-26T08:04:44.9136219Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9156922Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2024-06-26T08:04:44.9178944Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9203464Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2024-06-26T08:04:44.9225420Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9248967Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2024-06-26T08:04:44.9270967Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9294869Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2024-06-26T08:04:44.9315964Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9339384Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2024-06-26T08:04:44.9360540Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9385769Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2024-06-26T08:04:44.9407313Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9430370Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2024-06-26T08:04:44.9451613Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9477174Z Entering 'third_party/mimalloc' 2024-06-26T08:04:44.9500189Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9523544Z Entering 'third_party/nccl/nccl' 2024-06-26T08:04:44.9545015Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9568168Z Entering 'third_party/nlohmann' 2024-06-26T08:04:44.9590015Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9612881Z Entering 'third_party/onnx' 2024-06-26T08:04:44.9635298Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9669818Z Entering 'third_party/onnx/third_party/benchmark' 2024-06-26T08:04:44.9691374Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9714958Z Entering 'third_party/onnx/third_party/pybind11' 2024-06-26T08:04:44.9738050Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9762577Z Entering 'third_party/opentelemetry-cpp' 2024-06-26T08:04:44.9786128Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9812940Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2024-06-26T08:04:44.9834453Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9858202Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2024-06-26T08:04:44.9879734Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9902790Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2024-06-26T08:04:44.9924349Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9948708Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2024-06-26T08:04:44.9969668Z http.https://github.com/.extraheader 2024-06-26T08:04:44.9993777Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2024-06-26T08:04:45.0015614Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0039693Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2024-06-26T08:04:45.0061708Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0084934Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2024-06-26T08:04:45.0106218Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0129002Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2024-06-26T08:04:45.0150968Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0174255Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2024-06-26T08:04:45.0195369Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0219864Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2024-06-26T08:04:45.0241742Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0279655Z Entering 'third_party/pocketfft' 2024-06-26T08:04:45.0302857Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0328298Z Entering 'third_party/protobuf' 2024-06-26T08:04:45.0348432Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0372092Z Entering 'third_party/protobuf/third_party/benchmark' 2024-06-26T08:04:45.0393821Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0415082Z Entering 'third_party/protobuf/third_party/googletest' 2024-06-26T08:04:45.0436466Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0462330Z Entering 'third_party/psimd' 2024-06-26T08:04:45.0484363Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0507839Z Entering 'third_party/pthreadpool' 2024-06-26T08:04:45.0528273Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0552636Z Entering 'third_party/pybind11' 2024-06-26T08:04:45.0573092Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0595968Z Entering 'third_party/python-peachpy' 2024-06-26T08:04:45.0619169Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0640944Z Entering 'third_party/sleef' 2024-06-26T08:04:45.0663562Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0686945Z Entering 'third_party/tensorpipe' 2024-06-26T08:04:45.0708190Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0731086Z Entering 'third_party/tensorpipe/third_party/googletest' 2024-06-26T08:04:45.0752882Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0774214Z Entering 'third_party/tensorpipe/third_party/libnop' 2024-06-26T08:04:45.0795214Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0817905Z Entering 'third_party/tensorpipe/third_party/libuv' 2024-06-26T08:04:45.0839157Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0861777Z Entering 'third_party/tensorpipe/third_party/pybind11' 2024-06-26T08:04:45.0881244Z http.https://github.com/.extraheader 2024-06-26T08:04:45.0905396Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2024-06-26T08:04:45.0927351Z http.https://github.com/.extraheader 2024-06-26T08:04:45.1026312Z A job completed hook has been configured by the self-hosted runner administrator 2024-06-26T08:04:45.1046116Z ##[group]Run '/home/ec2-user/runner-scripts/cleanup.sh' 2024-06-26T08:04:45.1052747Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2024-06-26T08:04:45.1053271Z ##[endgroup] 2024-06-26T08:04:45.7665319Z Cleaning up orphan processes